Chapter 2: Git’s Architecture

15/10/2025

0 Views 0

SaveSavedRemoved 0

How Git Really Works: A Deep Dive Under the Hood

For most developers, Git is an indispensable tool we use every day. We commit, push, pull, and merge without a second thought. But have you ever stopped to wonder what’s actually happening behind the scenes? Understanding Git’s underlying architecture isn’t just academic—it’s the key to becoming a more proficient and confident developer, capable of solving complex problems with ease.

At its heart, Git is not a system that tracks file changes or differences (deltas). Instead, it’s something far more simple and powerful: a content-addressable filesystem. This means that at its core, Git is a sophisticated key-value store. You give Git a piece of content, and it gives you back a unique key—a hash—that you can use to retrieve that content later.

This fundamental design choice is what makes Git so fast, flexible, and robust. Let’s peel back the layers and explore the core components that make it all work.

The Foundation: SHA-1 Hashes and the Object Database

Everything in Git is stored as an “object,” and every object is identified by a unique 40-character SHA-1 hash. This hash is generated based on the content of the object itself.

This has a critical implication: if the content doesn’t change, the hash doesn’t change. Conversely, even a one-byte modification to a file will produce a completely different hash. This guarantees the integrity of your project’s history. You can be certain that the version of a file from a six-month-old commit is exactly as it was when it was first committed.

All of these objects are stored in your project’s .git/objects directory. This is Git’s object database—the engine room of your repository. There are three main types of objects you need to know.

The Three Core Git Objects

Your entire project history—every file, every directory, and every commit—is built from three simple object types.

1. The Blob (Binary Large Object)

A blob is the most basic object in Git. It stores the raw content of a file, and nothing else.

No Metadata: A blob doesn’t know the file’s name, its creation date, or its location within the project. It is simply a chunk of data.
Content-Driven: Git calculates the SHA-1 hash of the file’s content and stores it as a blob. If you have ten identical files in your repository, Git only stores one blob and references it ten times. This makes Git incredibly efficient at storing data.

Think of a blob as the “guts” of a file, completely detached from its identity.

2. The Tree

A tree is what gives your files structure. It represents a directory and contains a list of pointers to blobs and other trees.

Directory Mapping: Each entry in a tree object corresponds to a file or subdirectory.
Pointers and Metadata: For each entry, a tree stores the file/directory mode (e.g., executable), the object type (blob or tree), its SHA-1 hash, and its filename.

This is how Git reconstructs your project’s directory structure for any given point in time. A top-level tree points to blobs (files) and other trees (subdirectories), which in turn point to their own contents, creating a complete snapshot of your project.

3. The Commit

A commit is the object that ties everything together into a cohesive historical timeline. It represents a specific snapshot of your project at a single point in time.

A commit object contains:

A Pointer to a Tree: The SHA-1 hash of the top-level tree object that represents the project’s state for that commit.
Parent Commit(s): The SHA-1 hash of the preceding commit(s). This is what creates the historical chain. A merge commit will have multiple parent hashes.
Author and Committer Information: The name, email, and timestamp for who originally wrote the code and who committed it.
The Commit Message: The descriptive text you write to explain the changes.

When you run git log, you are simply walking through this chain of commit objects, from the most recent one to its parent, and so on.

Making It Usable: References and the Staging Area

The object system is powerful, but working directly with 40-character hashes would be impossible. This is where references come in.

References (Branches and Tags): A reference is a simple, human-readable pointer to a commit hash. When you create a branch like feature/new-login, you are creating a new pointer that points to a specific commit. A branch in Git is just a lightweight, movable pointer, which is why creating and merging branches is so fast. The special HEAD reference is a pointer to the branch you are currently on.
The Index (or Staging Area): Before you create a commit, you must first stage your changes with git add. This command doesn’t directly affect the object database. Instead, it updates a special file called The Index. The Index acts as a staging area, holding the snapshot of content that will become your next commit. When you run git commit, Git takes the state of the Index, creates the necessary tree and blob objects, and finally creates a commit object that points to the new top-level tree.

Practical Security and Takeaways

Understanding this architecture provides more than just trivia; it empowers you to use Git more effectively. When you encounter a detached HEAD state, you now know it simply means HEAD is pointing directly to a commit hash instead of a branch reference.

Actionable Tip: The integrity of Git relies on its hashing algorithm. While Git has historically used SHA-1, it is now transitioning to the more secure SHA-256. As you manage long-term projects, be aware of this industry-wide shift to protect against potential hash collision attacks, ensuring the future integrity of your repositories.

By seeing Git as a simple object database with pointers, you can demystify complex commands and gain the confidence to navigate any version control challenge.

Source: https://linuxhandbook.com/courses/git-for-devops/git-architecture/