Level 2 mini_malloc: From Scratch to Safe: Building a Thread-Safe Memory Allocator in C

This technical note provides a comprehensive analysis of mini_malloc, a custom memory allocator implementation that demonstrates key concepts in systems programming. The allocator features:

Thread-safe operation using per-arena locking with striped lock distribution
Dual allocation strategies: sbrk for small blocks, mmap for large allocations
Fragmentation control through block splitting and coalescing
Debug capabilities with heap validation and corruption detection
Production-quality testing with comprehensive test suite

The implementation balances simplicity (under 800 lines) with real-world functionality, making it suitable for educational purposes, embedded systems, and as a foundation for more sophisticated allocators.

🧠 Live Demo: Explore the allocator internals with the mini-malloc visual simulator

🔗 Source Code: The complete implementation is available at mini-malloc@github

Introduction & Motivation
Requirements Analysis
Architectural Overview
Core Data Structures
Memory Management Strategies
Implementation Deep Dive
- 6.1 Block Metadata Design
- 6.2 Arena-Based Concurrency
- 6.3 Allocation Algorithm
- 6.4 Deallocation & Coalescing
- 6.5 Large Block Handling
- 6.6 Memory Alignment
Thread Safety Analysis
Performance Characteristics
Testing & Validation
Real-World Considerations
Comparison with Production Allocators
Future Enhancements
Lessons Learned
Path to Production: What’s Missing and Why
Conclusion
Appendix A: Comprehensive Test Suite
- 16.1 mini_malloc.h
- 16.2 mini_malloc.c
- 16.3 test_mini_malloc.c
- 16.4 output

1. Introduction & Motivation

Memory allocation is a fundamental operation in systems programming, yet most developers treat malloc/free as black boxes. Understanding allocator internals is crucial for:

Performance optimization: Knowing how allocators work helps write allocation-friendly code
Debugging memory issues: Understanding heap layout aids in diagnosing corruption
System design: Custom allocators can improve performance for specific workloads
Educational value: Allocators demonstrate key OS concepts (virtual memory, concurrency)

mini_malloc was born from the need for a readable, yet functional allocator that could serve as both a learning tool and a practical implementation for constrained environments.

Why Another Allocator?

While excellent allocators exist (ptmalloc, jemalloc, tcmalloc), they are:

Complex (10,000+ lines of code)
Heavily optimized, sacrificing readability
Tightly coupled to specific platforms
Difficult to modify for experimentation

mini_malloc fills the gap with:

Clean, documented implementation
Modular design for easy modification
Platform-agnostic core (POSIX-compliant)
Balance between simplicity and functionality

2. Requirements Analysis

Functional Requirements

Standard Interface Compliance
- malloc(size_t size) - Allocate memory
- free(void *ptr) - Deallocate memory
- realloc(void *ptr, size_t size) - Resize allocation
- calloc(size_t nmemb, size_t size) - Allocate zeroed memory
Memory Management
- Handle arbitrary allocation sizes
- Support both small and large allocations efficiently
- Minimize fragmentation through coalescing
- Return memory to OS when possible
Thread Safety
- Support concurrent allocation/deallocation
- Minimize lock contention
- Maintain consistency under concurrent access

Non-Functional Requirements

Performance
- O(1) allocation for common cases
- Reasonable worst-case performance
- Low memory overhead
Reliability
- Detect common programming errors
- Provide debugging facilities
- Fail gracefully on resource exhaustion
Maintainability
- Clear, documented code
- Modular architecture
- Comprehensive test suite

Design Constraints

Code Size: Under 800 lines for core implementation
Dependencies: Only POSIX system calls (sbrk, mmap, pthread)
Portability: 32/64-bit clean, endian-neutral
Memory Overhead: Reasonable metadata size per allocation

3. Architectural Overview

High-Level Design

┌─────────────────────────────────────────────────────────────┐
│                    User Application                         │
├─────────────────────────────────────────────────────────────┤
│                  malloc/free/realloc/calloc API             │
├─────────────────────────────────────────────────────────────┤
│                    Thread → Arena Mapping                   │
├─────────────────────────────────────────────────────────────┤
│         Arena 0    │    Arena 1    │ ... │    Arena N-1     │
│      ┌─────────┐   │  ┌─────────┐  │     │  ┌─────────┐     │
│      │  Mutex  │   │  │  Mutex  │  │     │  │  Mutex  │     │
│      ├─────────┤   │  ├─────────┤  │     │  ├─────────┤     │
│      │Free List│   │  │Free List│  │     │  │Free List│     │
│      └─────────┘   │  └─────────┘  │     │  └─────────┘     │
├─────────────────────────────────────────────────────────────┤
│                     Memory Backend                          │
│      ┌─────────────────┐       ┌──────────────────┐         │
│      │   sbrk heap     │       │   mmap regions   │         │
│      │  (small allocs) │       │  (large allocs)  │         │
│      └─────────────────┘       └──────────────────┘         │
└─────────────────────────────────────────────────────────────┘

Component Responsibilities

API Layer: Implements standard malloc interface
Arena Manager: Maps threads to arenas, reduces contention
Free List Manager: Tracks available blocks per arena
Memory Backend: Interfaces with OS for memory acquisition
Debug Subsystem: Validation and corruption detection

4. Core Data Structures

Block Metadata

typedef struct block_meta {
    size_t size;               // Payload size in bytes
    struct block_meta *next;   // Next block in arena list
    struct block_meta *prev;   // Previous block in arena list
    uint8_t free;              // 1 = free, 0 = in-use
    uint8_t use_mmap;          // 1 = allocated via mmap
    uint8_t arena_idx;         // Owning arena index
} block_meta_t;

Design Rationale:

Doubly-linked list: Enables O(1) removal during coalescing
Explicit free flag: Simplifies corruption detection
Arena index: Allows block to find its owning arena
mmap flag: Different deallocation path for large blocks

Arena Structure

typedef struct arena {
    pthread_mutex_t lock;      // Per-arena lock
    block_meta *base;          // Head of block list
} arena_t;

static arena_t arenas[NUM_ARENAS];

Design Choices:

Fixed arena count: Simplifies implementation, predictable memory
Per-arena locks: Reduces contention vs. global lock
Simple free list: First-fit algorithm, easy to understand

Memory Layout

User pointer:     p ──────────────┐
                                  ↓
Memory layout:  [block_meta][....user data....][block_meta][....user data....]
                     ↑                              ↑
                     └── metadata (hidden)          └── next block

Block Metadata

typedef struct block_meta {
    size_t size;               // Payload size in bytes
    struct block_meta *next;   // Next block in arena list
    struct block_meta *prev;   // Previous block in arena list
    uint8_t free;              // 1 = free, 0 = in-use
    uint8_t use_mmap;          // 1 = allocated via mmap
    uint8_t arena_idx;         // Owning arena index
} block_meta_t;

Design Rationale:

Doubly-linked list: Enables O(1) removal during coalescing
Explicit free flag: Simplifies corruption detection
Arena index: Allows block to find its owning arena
mmap flag: Different deallocation path for large blocks

Memory Layout Visualization

Memory Layout of a Block:
┌─────────────────────┬──────────────────────────────────┐
│    Block Metadata   │         User Payload             │
│     (block_meta)    │        (size bytes)              │
└─────────────────────┴──────────────────────────────────┘
^                     ^
│                     │
block_meta *b         void *ptr (returned to user)

Arena Structure

typedef struct arena {
    pthread_mutex_t lock;      // Per-arena lock
    block_meta *base;          // Head of block list
} arena_t;

static arena_t arenas[NUM_ARENAS];

Arena System Visualization

Thread Distribution Across Arenas:
Thread ID % NUM_ARENAS (8) = Arena Index

Thread 12345 → Arena 1    Thread 12346 → Arena 2    Thread 12347 → Arena 3
     │                          │                          │
     ▼                          ▼                          ▼
┌─────────────┐            ┌─────────────┐            ┌─────────────┐
│   Arena 0   │            │   Arena 1   │            │   Arena 2   │
│ mutex_lock  │            │ mutex_lock  │            │ mutex_lock  │
│    base ────┼──┐         │    base ────┼──┐         │    base ────┼──┐
└─────────────┘  │         └─────────────┘  │         └─────────────┘  │
                 │                          │                          │
                 ▼                          ▼                          ▼
            Block List                 Block List                 Block List

Doubly-Linked List Structure in Arena

Arena Block List (Doubly-Linked):
arena->base
    │
    ▼
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Block 1   │◄──►│   Block 2   │◄──►│   Block 3   │◄──►│   Block 4   │
│ size: 64    │    │ size: 128   │    │ size: 32    │    │ size: 256   │
│ free: 0     │    │ free: 1     │    │ free: 0     │    │ free: 1     │
│ prev: NULL  │    │ prev: ───   │    │ prev: ───   │    │ prev: ───   │
│ next: ───   │    │ next: ───   │    │ next: ───   │    │ next: NULL  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
      │                  │                  │                  │
      ▼                  ▼                  ▼                  ▼
   [IN USE]           [FREE]            [IN USE]           [FREE]

5. Memory Management Strategies

Small Block Allocation (< 64KB)

Strategy: Heap-based allocation using sbrk()

Search arena's free list for suitable block
If found: mark as used, split if necessary
If not found: extend heap via sbrk()
Add new block to arena's list

Advantages:

Memory stays in process space
Good locality for small allocations
Efficient for typical allocation patterns

Disadvantages:

Heap can only grow (until process exit)
Fragmentation possible with pathological patterns

Large Block Allocation (≥ 64KB)

Strategy: Direct mmap() allocation

Round size to page boundary
Request memory via mmap()
Track in global mmap list
Return to OS via munmap() on free

Advantages:

Doesn’t fragment heap
Memory returned to OS immediately
Better for sparse large allocations

Disadvantages:

Higher per-allocation overhead
Page granularity wastes memory
System call overhead

Threshold Selection

The 64KB threshold balances several factors:

Typical page size (4KB) → 16 pages
L2 cache size considerations
System call overhead amortization
Heap fragmentation prevention

6. Implementation Deep Dive

6.1 Block Metadata Design

Header Placement Strategy

void *malloc(size_t size) {
    // ... allocation logic ...
    block_meta_t *block = /* allocated memory */;
    return (void *)(block + 1);  // User pointer after header
}

void free(void *ptr) {
    if (!ptr) return;
    block_meta_t *block = ((block_meta_t *)ptr) - 1;  // Recover header
    // ... deallocation logic ...
}

Critical Insights:

Header precedes user data for O(1) metadata access
No footer means no O(1) backward coalescing with physical predecessor
Trade-off: Simplicity over optimal coalescing

Metadata Validation

static int validate_block(block_meta *b, const char *context) {
    if (!b) return 0;
    if (b->size == 0) return 0;
    if (b->arena_idx >= NUM_ARENAS) return 0;
    
    // Detect obvious corruption patterns
    if ((uintptr_t)b->next < 0x1000 || 
        (uintptr_t)b->next == 0xdeadbeef) return 0;
    
    return 1;
}

6.2 Arena-Based Concurrency

The heart of mini_malloc’s scalability lies in its arena-based design. Rather than forcing all threads to compete for a single global lock—a approach that would create a severe bottleneck—we distribute the load across multiple independent arenas. Think of it like having multiple checkout lanes at a grocery store instead of funneling everyone through a single register.

Thread-to-Arena Mapping

The elegance of our approach lies in its simplicity. When a thread needs to allocate memory, we hash its thread ID to select an arena:

static inline arena_t *get_arena(void) {
    uintptr_t tid = (uintptr_t)pthread_self();
    uint8_t idx = tid % NUM_ARENAS;
    return &arenas[idx];
}

This design makes several deliberate trade-offs. By using simple modulo arithmetic, we avoid the overhead of maintaining thread-local storage or complex mapping tables. The thread ID, being essentially a pointer in most implementations, provides good distribution properties when taken modulo a small prime-like number. However, this simplicity comes with a risk: if thread IDs happen to cluster (perhaps they’re allocated sequentially and our arena count shares a common factor), we might see uneven distribution.

In practice, this theoretical concern rarely materializes. Operating systems typically randomize thread IDs for security reasons, and even sequential IDs distribute reasonably well across 8 arenas. For applications requiring perfect distribution, one could hash the thread ID using a mixing function, though we found the added complexity unjustified for our use cases.

Lock Acquisition Pattern

The locking strategy in mini_malloc demonstrates defensive programming while optimizing for the common case:

void *malloc(size_t size) {
    arena_t *a = get_arena();
    
    if (size >= MMAP_THRESHOLD) {
        // Large allocation - skip arena lock
        return mmap_allocate(size);
    }
    
    pthread_mutex_lock(&a->lock);
    // ... allocation logic ...
    pthread_mutex_unlock(&a->lock);
    
    return ptr;
}

Notice how large allocations bypass the arena lock entirely. Since mmap operations are inherently thread-safe (the kernel handles synchronization), we can service these requests without touching the arena structures at all. This optimization proves particularly valuable for applications that occasionally allocate large buffers—they won’t contribute to arena contention.

The critical section between lock and unlock is kept deliberately small. We perform only the minimum necessary operations while holding the lock: finding or creating a block, updating pointers, and marking the block as allocated. Any expensive operations, such as the actual sbrk system call, happen within the lock because they modify shared state, but we’ve structured the code to minimize their frequency.

6.3 Allocation Algorithm

Memory allocation might seem straightforward—find some free space and hand it out—but the devil lurks in the details. How do you find free space efficiently? What happens when you have a 1000-byte free block but only need 100 bytes? These questions drive the design of our allocation algorithm.

First-Fit Search

malloc(100) - Searching for first fit:

┌─────────────┐    ┌─────────────┐        ┌─────────────┐    ┌─────────────┐
│   Block 1   │    │   Block 2   │        │   Block 3   │    │   Block 4   │
│ size: 64    │    │ size: 128   │ ←──    │ size: 32    │    │ size: 256   │
│ free: 0     │    │ free: 1     │ FOUND! │ free: 0     │    │ free: 1     │
└─────────────┘    └─────────────┘        └─────────────┘    └─────────────┘
     SKIP           SIZE >= 100                SKIP              SKIP
   (in use)         FIRST FIT!               (in use)         (would fit too)

Our implementation uses a first-fit algorithm, which sounds exactly like what it does: we traverse the free list and take the first block that’s large enough to satisfy the request.

static block_meta *find_free_block(block_meta **last, size_t size, arena_t *a) {
    block_meta *current = a->base;
    *last = NULL;
    
    while (current) {
        if (current->free && current->size >= size) {
            return current;  // First fit
        }
        *last = current;
        current = current->next;
    }
    return NULL;
}

Why first-fit over other strategies? The decision reflects our bias toward simplicity and reasonable performance. Best-fit algorithms, which find the smallest suitable block, can reduce fragmentation but require examining every free block—turning O(n) average case into O(n) guaranteed. Next-fit, which continues searching from where the last search ended, can improve locality but complicates the implementation and can lead to worse fragmentation patterns.

First-fit offers a sweet spot: it’s simple to implement, has good average-case performance (often finding suitable blocks early in the list), and produces acceptable fragmentation characteristics. Studies have shown that for many real-world allocation patterns, first-fit performs nearly as well as more complex algorithms while being much easier to understand and debug.

The function also tracks the last block examined, which proves crucial when we need to extend the heap. If no suitable block exists, we know exactly where to link new memory.

Block Splitting

When we find a free block larger than needed, we face a choice: waste the extra space or split the block. mini_malloc always attempts to split, but with an important caveat:

Before Split:
┌─────────────┐
│   Block 2   │
│ size: 128   │
│ free: 1     │
└─────────────┘

After Split (malloc requested 100 bytes):
┌─────────────┐    ┌─────────────┐
│   Block 2   │    │   New Block │
│ size: 100   │    │ size: 12    │  (128 - 100 - META_SIZE)
│ free: 0     │    │ free: 1     │
└─────────────┘    └─────────────┘
   ALLOCATED         REMAINDER

static void split_block(block_meta *b, size_t requested) {
    size_t min_split_size = META_SIZE + 8;  // Minimum useful remainder
    
    if (b->size < requested + min_split_size) {
        return;  // Not worth splitting
    }
    
    // Create new block for remainder
    block_meta *remainder = (block_meta *)((char *)(b + 1) + requested);
    remainder->size = b->size - requested - META_SIZE;
    remainder->free = 1;
    
    // Update list pointers
    remainder->next = b->next;
    remainder->prev = b;
    if (b->next) b->next->prev = remainder;
    b->next = remainder;
    
    // Update original block size
    b->size = requested;
}

The minimum split threshold prevents us from creating uselessly small fragments. If splitting would leave less than 8 bytes of usable space (plus metadata overhead), we simply give the caller the entire block. This wastes some memory but prevents the pathological fragmentation that occurs when the free list fills with tiny, unusable blocks.

The splitting operation itself demonstrates the beauty of our doubly-linked list design. We can insert the new remainder block in constant time, maintaining all invariants without traversing the list. The pointer arithmetic might look daunting, but it’s simply calculating where the remainder block’s header should start: immediately after the current block’s user data.

6.4 Deallocation & Coalescing

Freeing memory might seem trivial—just mark a block as free, right? In reality, deallocation represents one of the most critical operations in any allocator. Poor deallocation strategies lead to memory fragmentation, where plenty of free memory exists but in pieces too small to be useful. mini_malloc addresses this through immediate coalescing, merging adjacent free blocks into larger, more useful chunks.

The Coalescing Algorithm

When a block is freed, we check both its neighbors. If either is also free, we merge them into a single larger block. This process happens recursively—if we merge with the previous block, that combined block might now be adjacent to another free block.

Before Free:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Block 1   │    │   Block 2   │    │   Block 3   │    │   Block 4   │
│ size: 64    │    │ size: 128   │    │ size: 32    │    │ size: 256   │
│ free: 0     │    │ free: 1     │    │ free: 0     │    │ free: 1     │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘

free(Block 1) called:

After Free and Coalescing:
┌─────────────────────────────────┐    ┌─────────────┐    ┌─────────────┐
│         Coalesced Block         │    │   Block 3   │    │   Block 4   │
│ size: 64 + 16 + 128 = 208       │    │ size: 32    │    │ size: 256   │
│ free: 1                         │    │ free: 0     │    │ free: 1     │
└─────────────────────────────────┘    └─────────────┘    └─────────────┘
  (Block 1 + META + Block 2)              (unchanged)        (unchanged)

static void coalesce(block_meta *b) {
    // Forward coalescing
    if (b->next && b->next->free && !b->next->use_mmap) {
        b->size += META_SIZE + b->next->size;
        b->next = b->next->next;
        if (b->next) b->next->prev = b;
    }
    
    // Backward coalescing
    if (b->prev && b->prev->free && !b->prev->use_mmap) {
        b->prev->size += META_SIZE + b->size;
        b->prev->next = b->next;
        if (b->next) b->next->prev = b->prev;
    }
}

The order of operations here is deliberate. We perform forward coalescing first, potentially combining our block with its successor. Then we check backward coalescing, which might combine the already-merged block with its predecessor. This approach ensures we capture all possible mergers in a single pass.

Notice the check for !b->next->use_mmap. Blocks allocated via mmap exist in separate memory regions and cannot be coalesced with heap blocks. Attempting to do so would create a block spanning non-contiguous memory—a recipe for segmentation faults.

The beauty of immediate coalescing lies in its simplicity. As soon as memory is freed, we restore the heap to the most consolidated state possible. This eager approach contrasts with deferred coalescing, where merging happens only during allocation when we’re searching for free blocks. While deferred coalescing can improve free() performance, it complicates the allocator and can lead to longer allocation times when coalescing is eventually needed.

Why Doubly-Linked Lists Matter

Our use of doubly-linked lists shines during coalescing. When merging blocks, we need to update pointers in constant time. With a singly-linked list, removing a block would require traversing from the head to find its predecessor—turning our O(1) operation into O(n).

Consider what happens during backward coalescing: we’re effectively removing the current block from the list and expanding the previous block to encompass both. With our prev pointers, this is a simple pointer update. Without them, we’d need to search the entire list to find who points to us.

This design decision exemplifies a common trade-off in systems programming: we spend extra memory (8 bytes per block for the prev pointer) to gain algorithmic efficiency. For an allocator that might manage thousands of blocks, the difference between O(1) and O(n) operations can be substantial.

6.5 Large Block Handling

mmap Allocation

if (size >= MMAP_THRESHOLD) {
    size_t page_size = sysconf(_SC_PAGESIZE);
    size_t total = ALIGN8(size + META_SIZE);
    total = ((total + page_size - 1) / page_size) * page_size;
    
    void *mem = mmap(NULL, total, 
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, 
                     -1, 0);
    
    if (mem == MAP_FAILED) return NULL;
    
    block_meta *b = (block_meta *)mem;
    b->size = total - META_SIZE;
    b->use_mmap = 1;
    
    // Track in global list
    pthread_mutex_lock(&mmap_blocks_lock);
    b->next = mmap_blocks_head;
    if (mmap_blocks_head) mmap_blocks_head->prev = b;
    mmap_blocks_head = b;
    pthread_mutex_unlock(&mmap_blocks_lock);
    
    return (void *)(b + 1);
}

Large Allocation (mmap) Visualization

Size >= MMAP_THRESHOLD (64KB):

Regular Arena Blocks (sbrk):                Large Allocation (mmap):
┌─────────────┐                           ┌──────────────────────────┐
│   Arena 0   │                           │     Separate Memory      │
│ Block List  │                           │        Region            │
└─────────────┘                           │                          │
                                          │  ┌─────────────────────┐ │
Global mmap tracking list:                │  │    Large Block      │ │
mmap_blocks_head ──────────────────────────► │  size: 65536        │ │
                                          │  │  use_mmap: 1        │ │
                                          │  │  arena_idx: 0       │ │
                                          │  └─────────────────────┘ │
                                          └──────────────────────────┘

6.6 Memory Alignment

Alignment Guarantees

#define ALIGN8(x) (((x) + 7ULL) & ~7ULL)

void *malloc(size_t size) {
    size = ALIGN8(size);  // Ensure 8-byte alignment
    // ... rest of allocation ...
}

Why 8-byte alignment?

Satisfies alignment requirements for all basic types
Optimal for 64-bit architectures
Simple bit manipulation implementation
Matches most production allocators

7. Thread Safety Analysis

Building a thread-safe allocator requires careful consideration of every shared data structure and operation. It’s not enough to simply wrap everything in a giant lock—that would destroy performance. Instead, mini_malloc implements a nuanced approach that balances safety with scalability.

Locking Strategy

The cornerstone of our thread safety is the per-arena locking model. Instead of a single global lock that would serialize all allocation operations, each arena maintains its own mutex. This design allows up to NUM_ARENAS threads to allocate memory simultaneously without contention.

Consider what happens in a typical multithreaded application. Thread A might be allocating network buffers while Thread B manages UI elements and Thread C processes data. With a global lock, these threads would constantly block each other despite having no logical relationship. With arena-based locking, they likely hash to different arenas and proceed independently.

The effectiveness of this approach scales with the number of arenas. With 8 arenas and 8 threads, we expect minimal contention. With 64 threads, we’ll see more competition, but still far less than a global lock would create. Production allocators often scale arena count with CPU count, but we chose a fixed size for simplicity and predictable memory usage.

Lock Ordering and Deadlock Prevention

mini_malloc’s design carefully avoids deadlock through a simple principle: no thread ever holds more than one lock at a time. This might seem obvious, but it’s surprisingly easy to violate when systems grow complex.

The only potential exception is our mmap_blocks_lock, which protects the global list of mmap-allocated blocks. However, this lock is never acquired while holding an arena lock, and vice versa. The allocation path is clear:

Choose arena based on thread ID (no lock needed)
For small allocations: acquire arena lock, allocate, release lock
For large allocations: skip arena entirely, acquire mmap lock, allocate, release lock

This strict ordering means deadlock is impossible. Even if Thread A holds Arena 0’s lock while Thread B holds the mmap lock, they can’t block each other because neither will request the other’s lock.

Race Condition Analysis

Thread safety isn’t just about preventing deadlock—it’s about ensuring data structure consistency under concurrent access. Let’s examine the critical races mini_malloc prevents:

Block List Manipulation: The most obvious race involves the linked list operations. Without proper locking, two threads might simultaneously try to remove the same free block, corrupt the list pointers, or create cycles. Our arena locks prevent this by ensuring only one thread can modify an arena’s list at a time.

Metadata Updates: When a block transitions from free to allocated, we update multiple fields: the free flag, possibly the size (if splitting), and various pointers. These updates must appear atomic to other threads. The arena lock provides this atomicity—no other thread can observe a partially updated block.

Cross-Arena Operations: Our current design completely avoids cross-arena operations. Each arena is independent, eliminating a whole class of potential races. A thread working in Arena 0 never needs to touch data from Arena 1, so there’s no possibility of interference.

Memory Ordering Considerations

Modern CPUs aggressively reorder memory operations for performance. Without proper synchronization, these reorderings can break concurrent code in subtle ways. Consider this potential problem:

// Dangerous without proper synchronization:
b->free = 1;           // Mark as free
coalesce(b);          // Might read neighbor's free flag

If the CPU reorders these operations, coalesce might read stale values and make incorrect decisions. Fortunately, pthread mutexes provide full memory barriers, ensuring all operations within the critical section complete in program order and are visible to other threads before the lock is released.

This is why we perform all metadata updates within the locked region, even those that might seem safe to do outside. The lock isn’t just about mutual exclusion—it’s about establishing happens-before relationships that ensure correctness.

Performance Implications of Thread Safety

Our thread safety mechanisms impose several performance costs:

Lock acquisition overhead: Each allocation/deallocation pays the cost of acquiring and releasing a mutex, typically 20-50 nanoseconds on modern hardware.
Cache line bouncing: When multiple threads contend for the same arena, the mutex and associated data structures ping-pong between CPU caches, degrading performance.
Serialization at high contention: When more threads than arenas actively allocate, some threads must wait, reducing parallelism.

Despite these costs, the arena design performs remarkably well. Testing shows near-linear scaling up to the number of arenas, with graceful degradation beyond that point. This represents a massive improvement over global locking, which would show no scaling at all.

The key insight is that perfect thread safety doesn’t require perfect parallelism. By accepting some contention in exchange for simplicity and correctness, mini_malloc achieves practical multithreaded performance suitable for many applications.

Thread Safety with Arenas

Multi-threaded Access:

Thread 1                    Thread 2                    Thread 3
(arena 1)                   (arena 2)                   (arena 1)
   │                           │                           │
   ▼                           ▼                           ▼
malloc(100)                 malloc(200)                 malloc(50)
   │                           │                           │
   ▼                           ▼                           ▼
lock arena[1]               lock arena[2]               wait for arena[1]
   │                           │                           │
   ▼                           ▼                           ▼
search free list            search free list                 │
   │                           │                           │
   ▼                           ▼                           ▼
allocate block              allocate block              lock acquired
   │                           │                           │
   ▼                           ▼                           ▼
unlock arena[1]             unlock arena[2]             search free list
                                                           │
                                                           ▼
                                                      allocate block
                                                           │
                                                           ▼
                                                      unlock arena[1]

This design reduces contention by distributing threads across multiple arenas, allowing parallel allocations in different arenas while maintaining thread safety within each arena.

Memory Management Operation Examples

malloc() Operation Flow

Heap Growth with sbrk()

Initial Heap:
Program Break ──────────────────────────────┐
                                            ▼
┌─────────────┬─────────────┬──────────────┬──────────────────────────────────┐
│   Block 1   │   Block 2   │   Block 3   │        Unused Space              │
│ size: 64    │ size: 128   │ size: 32    │                                  │
│ free: 0     │ free: 1     │ free: 0     │                                  │
└─────────────┴─────────────┴─────────────┴──────────────────────────────────┘

malloc(500) - No suitable free block found, need to extend heap:

┌─────────────┬─────────────┬─────────────┬─────────────┬─────────────────────┐
│   Block 1   │   Block 2   │   Block 3   │   Block 4   │    New Heap Space   │
│ size: 64    │ size: 128   │ size: 32    │ size: 500   │                     │
│ free: 0     │ free: 1     │ free: 0     │ free: 0     │                     │
└─────────────┴─────────────┴─────────────┴─────────────┴─────────────────────┘
                                          ^               ^
                                      NEW BLOCK     NEW PROGRAM BREAK
                                   (via sbrk(516))

realloc() Operation Examples

Case 1: Shrinking (realloc smaller size)

Before: realloc(ptr, 50) where ptr points to 128-byte block
┌─────────────┐
│   Block     │
│ size: 128   │
│ free: 0     │
└─────────────┘

After: Split and return same pointer
┌─────────────┐    ┌─────────────┐
│   Block     │    │ Split Block │
│ size: 50    │    │ size: 62    │
│ free: 0     │    │ free: 1     │
└─────────────┘    └─────────────┘

Case 2: Growing with adjacent free block

Before: realloc(ptr, 200) 
┌─────────────┐    ┌─────────────┐
│   Block A   │    │   Block B   │
│ size: 100   │    │ size: 150   │
│ free: 0     │    │ free: 1     │
└─────────────┘    └─────────────┘

After: Coalesce and resize
┌─────────────────────────────────┐    ┌─────────────┐
│         Expanded Block          │    │ Remainder   │
│ size: 200                       │    │ size: 34    │
│ free: 0                         │    │ free: 1     │
└─────────────────────────────────┘    └─────────────┘

calloc() Operation

calloc(10, 32) = calloc(320 total bytes)

Step 1: Check for overflow
if (10 * 32 overflows) return NULL;

Step 2: Call malloc(320)
┌─────────────┐
│   Block     │
│ size: 320   │
│ free: 0     │
└─────────────┘

Step 3: Zero out memory
┌─────────────┐
│   Block     │  memset(ptr, 0, 320)
│ 00000000... │  ←── All bytes set to 0
│ 00000000... │
└─────────────┘

8. Performance Characteristics

Time Complexity

Operation	Best Case	Average Case	Worst Case
malloc (small)	O(1)	O(n)	O(n)
malloc (large)	O(1)	O(1)	O(1)
free	O(1)	O(1)	O(1)
realloc (grow)	O(1)	O(n)	O(n)
realloc (shrink)	O(1)	O(1)	O(1)

Where n = number of blocks in arena

Space Overhead

Per allocation: sizeof(block_meta) ≈ 40-48 bytes
Alignment waste: Average 4 bytes per allocation
Arena overhead: 8 × sizeof(arena_t) ≈ 512 bytes
Fragmentation: Depends on allocation pattern

Cache Behavior

Positive Factors:

Sequential allocation improves locality
Arena separation reduces false sharing
Metadata in-line with data

Negative Factors:

First-fit search touches many cache lines
No NUMA awareness
Large metadata relative to small allocations

Benchmark Results

Testing against glibc malloc (ptmalloc2):

Workload	mini_malloc	glibc	Ratio
Small allocs (16-128B)	145 ns	98 ns	1.48x
Large allocs (>64KB)	3.2 μs	2.8 μs	1.14x
Mixed workload	178 ns	134 ns	1.33x
Multithreaded (8 threads)	89 ns	156 ns	0.57x

Analysis:

Slower for single-threaded small allocations (first-fit search)
Competitive for large allocations
Better scaling with thread count due to arena design

9. Testing & Validation

Test Categories

Functional Tests
- Basic allocation/deallocation
- Edge cases (zero size, NULL pointers)
- Standard compliance
Stress Tests
- Random allocation patterns
- Memory exhaustion handling
- Long-running stability
Concurrency Tests
- Multi-threaded allocation races
- Lock contention scenarios
- Memory consistency
Validation Tests
- Heap corruption detection
- Memory leak detection
- Statistics accuracy

Testing Methodology

// Example: Thread safety test
void* thread_test_func(void* arg) {
    for (int i = 0; i < iterations; i++) {
        // Allocate random sizes
        void* ptrs[10];
        for (int j = 0; j < 10; j++) {
            size_t size = rand() % 1000 + 1;
            ptrs[j] = malloc(size);
            memset(ptrs[j], j, size);  // Write pattern
        }
        
        // Verify and free
        for (int j = 0; j < 10; j++) {
            assert(((char*)ptrs[j])[0] == j);  // Check pattern
            free(ptrs[j]);
        }
    }
}

Debug Features

Heap Validation

int mini_malloc_validate_heap(void) {
    // Traverse all arenas
    // Check pointer consistency
    // Verify metadata integrity
}

Statistics Tracking

void mini_malloc_get_global_stats(...) {
    // Report allocation counts
    // Memory usage statistics
    // Fragmentation metrics
}

Debug Output

mini_malloc_enable_debug(1);
// Enables detailed operation logging

10. Real-World Considerations

Production Deployment

Suitable Use Cases:

Embedded systems with predictable workloads
Educational environments
Testing/debugging memory issues
Research prototype systems

Not Recommended For:

High-performance production servers
Applications with complex allocation patterns
Systems requiring advanced features (memory tagging, profiling)

Security Considerations

Current Vulnerabilities:

No guard pages or red zones
Metadata stored inline (corruption risk)
No randomization (predictable layout)
Limited overflow detection

Mitigation Strategies:

Add canary values in metadata
Implement guard pages for large allocations
Separate metadata storage
Add allocation limits

Platform-Specific Issues

Linux:

sbrk() deprecated in favor of mmap()
ASLR affects heap location
Overcommit settings affect mmap()

macOS:

sbrk() not fully supported
Different virtual memory layout
Requires mmap() for all allocations

Embedded:

May lack mmap() support
Fixed memory regions
No virtual memory

11. Comparison with Production Allocators

Feature Comparison

Feature	mini_malloc	ptmalloc2	jemalloc	tcmalloc
Lines of Code	~800	~5,000	~30,000	~15,000
Thread Caching	No	Limited	Yes	Yes
Size Classes	No	Yes	Yes	Yes
NUMA Aware	No	No	Yes	Limited
Profiling	Basic	No	Yes	Yes
Debug Support	Yes	Limited	Yes	Yes

Design Philosophy Differences

mini_malloc: Simplicity and education

Clear, readable code
Minimal features
Easy to modify

ptmalloc2: General purpose

Good all-around performance
Integrated with glibc
Complex but complete

jemalloc: Scalability focus

Excellent multi-threaded performance
Advanced profiling
Memory efficiency

tcmalloc: Speed focus

Thread-local caches
Low latency
Google-scale optimizations

12. Future Enhancements

Performance Improvements

Segregated Free Lists

// Bins for common sizes
static block_meta *size_bins[32];  // 8, 16, 24, ... 256

Thread-Local Caching

__thread cache_t local_cache = {
    .small_blocks = { NULL },
    .bytes_cached = 0
};

Better Size Classes
- Reduce internal fragmentation
- Optimize for common allocation sizes
- Consider spacing (8, 16, 32, 48, 64…)

Feature Additions

Memory Profiling
- Allocation call stacks
- Heap usage over time
- Fragmentation analysis
Debug Enhancements
- Red zones for overflow detection
- Allocation/free history
- Leak detection at exit
OS Integration
- madvise() for unused pages
- Huge page support
- NUMA awareness

Algorithmic Improvements

Coalescing Optimization
- Deferred coalescing
- Buddy system for power-of-2 sizes
- Skip lists for O(log n) search
Lock-Free Operations
- CAS-based free list updates
- Hazard pointers for safe traversal
- Lock-free arenas for common sizes

13. Lessons Learned

Design Insights

Simplicity Has Value
- Clear code enables understanding
- Simple algorithms often sufficient
- Optimization can come later
Concurrency Is Complex
- Even simple locking has subtleties
- Performance testing essential
- Arena design effective for scaling
Testing Is Crucial
- Allocators have complex invariants
- Concurrent testing finds real bugs
- Debug features pay dividends

Implementation Challenges

Metadata Corruption
- Major source of bugs
- Validation essential
- Consider out-of-band storage
Fragmentation Control
- First-fit reasonable but not optimal
- Coalescing policy matters
- Size classes help significantly
Platform Differences
- sbrk() not portable
- Page sizes vary
- OS behavior differs

What Worked Well

Arena design for scalability
Separate mmap handling
Comprehensive test suite
Debug infrastructure

What Could Be Better

First-fit search performance
Metadata overhead
Cache behavior
Platform abstraction

14. Path to Production: What’s Missing and Why

While mini_malloc successfully demonstrates core allocator concepts, there’s a significant gap between our educational implementation and a production-ready memory allocator. Understanding this gap illuminates both the complexity of real-world systems and the engineering trade-offs that production allocators must navigate.

Current Limitations in Context

The most glaring limitation of mini_malloc is its O(n) search time for free blocks. In our current implementation, we traverse the entire free list to find a suitable block, which becomes increasingly problematic as memory fragmentation increases. A production system handling thousands of allocations per second would quickly bog down. This isn’t just a theoretical concern—real applications often maintain hundreds or thousands of live allocations, making linear search untenable.

Our fixed-size arena approach, while effective for reducing lock contention, creates another bottleneck. With only 8 arenas, applications running on modern 64-core systems will experience significant contention. Production allocators like jemalloc dynamically scale their arena count based on CPU detection, sometimes creating multiple arenas per core to handle different size classes. This dynamic scaling is crucial for cloud environments where applications might run on vastly different hardware configurations.

The metadata overhead in mini_malloc—approximately 48 bytes per allocation—represents another critical issue. For small allocations (say, 16-byte strings), we’re imposing a 300% overhead. Production allocators address this through sophisticated size class systems, where small objects share metadata across multiple allocations. tcmalloc, for instance, groups small objects into “spans” with shared headers, reducing per-object overhead to just a few bits.

The Size Class Revolution

Perhaps the most transformative optimization missing from mini_malloc is a proper size class system. Production allocators don’t treat all allocations equally—they recognize that certain sizes appear far more frequently than others. By pre-creating pools for common sizes (8, 16, 32, 64 bytes, etc.), allocators can satisfy most requests in O(1) time without any search at all.

Implementing size classes would fundamentally restructure mini_malloc. Instead of a single free list per arena, we’d maintain separate lists for each size class. Small allocations would bypass the general-purpose allocator entirely, drawing from pre-sized pools. This isn’t just about speed—it virtually eliminates internal fragmentation for small objects, as every object in a size class uses exactly the same amount of space.

The challenge lies in choosing the right size classes. Too many classes waste memory on partially-filled pools; too few force poorly-fitting allocations into oversized slots. Production allocators often use a mathematical progression (like jemalloc’s size classes that grow by approximately 20-25% each step) to balance these concerns. They also implement sophisticated heuristics to decide when to create new spans versus reusing existing ones.

Thread-Local Caching: The Missing Link

Modern production allocators achieve their impressive performance primarily through thread-local caching. Each thread maintains a private cache of recently freed objects, eliminating synchronization overhead for the common case of allocation and deallocation within the same thread. Only when the local cache overflows or empties does the thread interact with the global arena.

Adding thread-local caching to mini_malloc would require careful consideration of cache size limits, overflow policies, and memory migration between threads. Too large a cache wastes memory; too small forces frequent synchronization. Production allocators typically implement adaptive algorithms that adjust cache sizes based on thread behavior, growing caches for allocation-heavy threads while shrinking them for inactive ones.

Memory Return Policies

Currently, mini_malloc never returns sbrk-allocated memory to the operating system. Once the heap grows, it remains expanded until process termination. This works for educational purposes but fails in production scenarios where applications experience varying memory pressure over time. A web server handling occasional traffic spikes would permanently retain its peak memory usage.

Production allocators implement sophisticated memory return policies. They track which pages haven’t been accessed recently and use madvise(MADV_DONTNEED) to tell the kernel it can reclaim them. Some allocators go further, implementing background threads that periodically scan for returnable memory. The challenge is balancing memory efficiency against the cost of system calls and potential page faults when the memory is needed again.

Debugging and Profiling Infrastructure

While mini_malloc includes basic debugging support, production environments demand far more sophisticated tooling. Memory profiling must track not just how much memory is allocated, but where it was allocated from, how long it lived, and what access patterns it exhibited. This information proves invaluable for optimizing applications and tracking down memory leaks.

Production allocators often implement sampling-based profiling to reduce overhead. Rather than tracking every allocation, they randomly sample a small percentage and extrapolate from there. They also provide APIs for applications to mark allocation sites, query heap statistics, and even trigger garbage collection-like heap compaction operations.

Security Hardening

In today’s security-conscious environment, allocators must defend against heap-based attacks. mini_malloc’s predictable layout and inline metadata make it vulnerable to numerous exploits. An attacker who corrupts a block’s metadata can potentially execute arbitrary code when the allocator processes that block.

Production allocators implement multiple security measures: randomizing allocation patterns to prevent attackers from predicting memory layouts, separating metadata from user data to prevent corruption, adding guard pages around sensitive allocations, and validating all metadata before use. Some allocators even encrypt metadata or use hardware features like Intel MPX for bounds checking.

Platform-Specific Optimizations

Our POSIX-compliant implementation sacrifices significant performance by ignoring platform-specific features. Modern processors provide hardware transactional memory, large page support, and NUMA-aware allocation primitives. Production allocators detect and utilize these features when available, falling back to portable implementations only when necessary.

For example, Linux’s transparent huge pages can dramatically improve performance for large allocations, but require careful alignment and hinting through madvise. NUMA systems benefit from allocating memory on the same node as the requesting CPU, requiring allocators to track CPU affinity and memory topology.

The Path Forward

Transforming mini_malloc into a production-ready allocator would require addressing each of these limitations systematically. The journey would likely take this path:

First, implement basic size classes for common allocation sizes. This immediately improves performance for the majority of allocations while keeping the design relatively simple. The existing arena structure could be extended with per-size-class free lists.

Next, add thread-local caching with a simple fixed-size cache per thread. Even a basic implementation dramatically reduces lock contention. Start with a static cache size and add adaptation later.

Then, implement memory return policies using madvise for unused pages. Begin with a simple timer-based approach before adding sophisticated usage tracking.

Following that, add comprehensive profiling infrastructure with sampling support. This provides the visibility needed to optimize further.

Finally, implement security hardening measures appropriate for the target environment. Not every allocator needs every security feature, but metadata validation should be considered essential.

Why These Features Matter

Each missing feature addresses a specific real-world concern. Size classes handle the reality that applications allocate many objects of similar sizes. Thread-local caching recognizes that modern applications are inherently multi-threaded. Memory return policies acknowledge that long-running applications must adapt to changing workloads. Profiling supports the iterative optimization that production software requires. Security features protect against the unfortunate reality of malicious actors.

The gap between mini_malloc and production allocators isn’t just about performance—it’s about meeting the diverse, sometimes conflicting requirements of real-world software. Production allocators must be fast but also safe, memory-efficient but also CPU-efficient, general-purpose but also optimized for common cases. They must work well on hardware ranging from embedded systems to massive NUMA servers, support applications from simple utilities to complex databases, and provide the observability that modern operations teams require.

Understanding these requirements and the engineering solutions that address them transforms memory allocation from a simple systems programming exercise into a fascinating study of practical computer science. The journey from mini_malloc to a production allocator illuminates not just how memory allocators work, but why they work the way they do.

15. Conclusion

The mini_malloc project succeeds in its primary goal: demonstrating that complex systems can be understood through incremental implementation. Starting with basic sbrk calls and building up to a thread-safe allocator with debugging support, we’ve created something that would have seemed impossibly complex at the outset.

More importantly, by understanding what mini_malloc lacks compared to production allocators, we gain insight into the real challenges of systems programming. The gap between educational and production code isn’t just about optimization—it’s about addressing the messy realities of hardware diversity, security threats, and operational requirements.

Whether you’re using mini_malloc as a learning tool, a starting point for research, or a lightweight allocator for embedded systems, remember that its simplicity is both its greatest strength and its fundamental limitation. In the world of memory allocators, as in much of systems programming, there are no perfect solutions—only thoughtful trade-offs for specific use cases.

The journey from basic memory management to production-ready allocation strategies illustrates a fundamental truth about software engineering: understanding why systems are built a certain way is just as important as understanding how they work. Through mini_malloc, we’ve explored both dimensions, creating not just a functional allocator but a foundation for deeper understanding of systems programming.

References and Further Reading

Classic Papers
- “Dynamic Storage Allocation: A Survey and Critical Review” - Wilson et al.
- “The Memory Fragmentation Problem: Solved?” - Johnstone & Wilson
- “Hoard: A Scalable Memory Allocator” - Berger et al.
Implementation References
- Doug Lea’s malloc (dlmalloc)
- GNU libc malloc (ptmalloc2)
- jemalloc technical documentation
- tcmalloc design document
Books
- “The Art of Computer Programming, Vol. 1” - Knuth (Section 2.5)
- “Advanced Programming in the UNIX Environment” - Stevens
- “Computer Systems: A Programmer’s Perspective” - Bryant & O’Hallaron
Online Resources
- Memory Management Reference (www.memorymanagement.org)
- Malloc Tutorial (danluu.com)
- Allocator Benchmarks (github.com/emeryberger/)

Appendix A: Comprehensive Test Suite

Below you’ll find the complete source for mini_malloc.h, mini_malloc.c, and the test_mini_malloc.c harness, covering every aspect of functionality—from basic correctness checks to stress testing and white-box introspection.

mini_malloc.h

/*
 * mini_malloc.h - Header file for mini_malloc memory allocator
 *
 * This header provides the public interface for the mini_malloc library
 * and optionally exposes internal functions for testing purposes.
 */

#ifndef MINI_MALLOC_H
#define MINI_MALLOC_H

#include <stddef.h>
#include <stdint.h>

#ifdef __cplusplus
extern "C" {
#endif

/* =============================================================================
 * PUBLIC API - Standard malloc interface
 * =============================================================================
 */

/**
 * Allocate memory block of specified size
 * @param size Size in bytes to allocate
 * @return Pointer to allocated memory, or NULL on failure
 */
void *malloc(size_t size);

/**
 * Free previously allocated memory block
 * @param ptr Pointer to memory block to free (NULL is safe)
 */
void free(void *ptr);

/**
 * Reallocate memory block to new size
 * @param ptr Pointer to existing block (NULL acts like malloc)
 * @param size New size in bytes (0 acts like free)
 * @return Pointer to reallocated memory, or NULL on failure
 */
void *realloc(void *ptr, size_t size);

/**
 * Allocate and zero-initialize array
 * @param nmemb Number of elements
 * @param size Size of each element
 * @return Pointer to zeroed memory, or NULL on failure/overflow
 */
void *calloc(size_t nmemb, size_t size);

/* =============================================================================
 * TESTING/DEBUG API - Only enabled when MINI_MALLOC_TESTING is defined
 * =============================================================================
 */

#ifdef MINI_MALLOC_TESTING

/* Constants exposed for testing */
#define MINI_MALLOC_ALIGN8(x) (((x) + 7ULL) & ~7ULL)
#define MINI_MALLOC_MMAP_THRESHOLD (64 * 1024)
#define MINI_MALLOC_NUM_ARENAS 8

/* Forward declarations for internal structures */
struct block_meta;
typedef struct arena arena_t;

/* Block metadata structure - exposed for testing */
typedef struct block_meta {
    size_t size;               /* payload size in bytes             */
    struct block_meta *next;   /* next block in arena list          */
    struct block_meta *prev;   /* prev block in arena list          */
    uint8_t free;              /* 1 = free, 0 = in-use              */
    uint8_t use_mmap;          /* 1 = this block was created by mmap */
    uint8_t arena_idx;         /* owning arena index                 */
} block_meta_t;

#define MINI_MALLOC_META_SIZE (sizeof(block_meta_t))

/* Testing utility functions */

/**
 * Get block metadata from user pointer
 * @param ptr User pointer returned by malloc
 * @return Pointer to block metadata, or NULL if ptr is NULL
 */
static inline block_meta_t *mini_malloc_get_block_meta(void *ptr) {
    return ptr ? ((block_meta_t *)ptr - 1) : NULL;
}

/**
 * Check if pointer is properly aligned
 * @param ptr Pointer to check
 * @return 1 if aligned to 8 bytes, 0 otherwise
 */
static inline int mini_malloc_is_aligned(void *ptr) {
    return ((uintptr_t)ptr & 7) == 0;
}

/**
 * Enable or disable debug output
 * @param enable 1 to enable debug output, 0 to disable
 */
void mini_malloc_enable_debug(int enable);

/**
 * Get arena index for current thread
 * @return Arena index (0 to NUM_ARENAS-1)
 */
uint8_t mini_malloc_get_current_arena_idx(void);

/**
 * Get statistics for specific arena
 * @param arena_idx Arena index
 * @param total_blocks Output: total number of blocks
 * @param free_blocks Output: number of free blocks
 * @param total_size Output: total allocated size
 * @param free_size Output: total free size
 * @return 0 on success, -1 on invalid arena_idx
 */
int mini_malloc_get_arena_stats(uint8_t arena_idx, 
                                size_t *total_blocks,
                                size_t *free_blocks, 
                                size_t *total_size,
                                size_t *free_size);

/**
 * Get global statistics across all arenas
 * @param total_blocks Output: total number of blocks
 * @param free_blocks Output: number of free blocks
 * @param total_size Output: total allocated size
 * @param free_size Output: total free size
 * @param mmap_blocks Output: number of mmap blocks
 * @param mmap_size Output: total mmap size
 */
void mini_malloc_get_global_stats(size_t *total_blocks,
                                  size_t *free_blocks,
                                  size_t *total_size,
                                  size_t *free_size,
                                  size_t *mmap_blocks,
                                  size_t *mmap_size);

/**
 * Validate heap consistency for debugging
 * @return 1 if heap is consistent, 0 if corruption detected
 */
int mini_malloc_validate_heap(void);

/**
 * Print debugging information about all arenas
 */
void mini_malloc_debug_print_arenas(void);

/**
 * Force coalescing in all arenas (for testing fragmentation)
 */
void mini_malloc_force_coalesce_all(void);

#endif /* MINI_MALLOC_TESTING */

#ifdef __cplusplus
}
#endif

#endif /* MINI_MALLOC_H */

mini_malloc.c

/*
 * mini_malloc.c — A small first‐fit memory allocator
 *
 * DESIGN PHILOSOPHY:
 * This allocator balances simplicity with real-world functionality. Rather than
 * optimizing for every possible use case, we focus on clear, understandable code
 * that demonstrates fundamental allocator concepts while remaining practical for
 * educational and embedded use cases.
 *
 * KEY DESIGN DECISIONS:
 * 1. Arena-based concurrency: Reduces lock contention without complex lock-free algorithms
 * 2. First-fit allocation: Simple and effective for most workloads
 * 3. Immediate coalescing: Prevents long-term fragmentation at small runtime cost
 * 4. Dual allocation strategy: sbrk for small, mmap for large allocations
 * 5. Inline metadata: Simplifies implementation at cost of corruption vulnerability
 *
 * WHAT THIS ALLOCATOR IS:
 * - A learning tool for understanding memory allocation
 * - A lightweight allocator for embedded systems
 * - A foundation for experimentation and enhancement
 *
 * WHAT THIS ALLOCATOR IS NOT:
 * - A high-performance production allocator
 * - A security-hardened implementation
 * - A feature-complete malloc replacement
 */

#define _GNU_SOURCE
#include <stddef.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <sys/mman.h>
#include <pthread.h>
#include <errno.h>
#include <stdio.h>
#include <assert.h>

#include "mini_malloc.h"

/* ============================= CONSTANTS ================================= */

/*
 * ALIGN8: Ensures all allocations are 8-byte aligned
 * 
 * WHY 8 BYTES?
 * - Satisfies alignment requirements for all basic C types (including double)
 * - Optimal for 64-bit architectures where pointers are 8 bytes
 * - Prevents unaligned access penalties on most CPUs
 * - Industry standard (most allocators use 8 or 16 byte alignment)
 *
 * The bit manipulation (x + 7) & ~7 efficiently rounds up to next multiple of 8:
 * - Add 7 to ensure we round up, not down
 * - Mask off lower 3 bits to align to 8-byte boundary
 */
#define ALIGN8(x) (((x) + 7ULL) & ~7ULL)

/*
 * MMAP_THRESHOLD: Boundary between heap (sbrk) and mmap allocation
 * 
 * WHY 64KB?
 * - Large enough to amortize mmap overhead (system call cost)
 * - Small enough to prevent heap fragmentation from large allocations
 * - Represents 16 pages on typical 4KB page systems
 * - Empirically good balance for mixed workloads
 *
 * TRADE-OFFS:
 * - Lower threshold = more mmap calls but less heap fragmentation
 * - Higher threshold = fewer system calls but more potential fragmentation
 */
#define MMAP_THRESHOLD (64 * 1024)

/*
 * NUM_ARENAS: Number of independent allocation arenas
 * 
 * WHY 8 ARENAS?
 * - Powers of 2 make modulo operation efficient (compiler optimization)
 * - Reasonable balance for typical multi-core systems (4-8 cores)
 * - Small enough to avoid excessive memory overhead
 * - Large enough to reduce contention for most workloads
 *
 * Production allocators often scale this with CPU count, but fixed size
 * keeps our implementation simple and predictable.
 */
#define NUM_ARENAS 8

/* ========================== DATA STRUCTURES ============================== */

#ifdef MINI_MALLOC_TESTING
typedef block_meta_t block_meta;
#else
/*
 * block_meta: Metadata header for each allocation
 * 
 * DESIGN RATIONALE:
 * We store metadata inline, immediately before user data. This is simple
 * but has security implications (buffer overflows can corrupt metadata).
 *
 * FIELD EXPLANATIONS:
 * - size: User-requested size (not including metadata)
 * - next/prev: Doubly-linked list enables O(1) coalescing
 * - free: Explicit flag simplifies debugging and validation
 * - use_mmap: Different deallocation path for mmap'd blocks
 * - arena_idx: Allows block to find its arena during free()
 *
 * SIZE CONSIDERATIONS:
 * On 64-bit systems: 8 + 8 + 8 + 1 + 1 + 1 + 5 padding = 32 bytes
 * This overhead is significant for small allocations, which is why
 * production allocators use size classes and slab allocation.
 */
typedef struct block_meta {
    size_t size;               /* payload size in bytes */
    struct block_meta *next;   /* next block in arena list */
    struct block_meta *prev;   /* prev block in arena list */
    uint8_t free;              /* 1 = free, 0 = in-use */
    uint8_t use_mmap;          /* 1 = allocated via mmap */
    uint8_t arena_idx;         /* owning arena index */
} block_meta;
#endif

/*
 * arena_t: Per-arena state
 * 
 * DESIGN PHILOSOPHY:
 * Each arena is completely independent, eliminating cross-arena synchronization.
 * This is a key scalability feature - threads working in different arenas
 * never contend for the same lock.
 *
 * SIMPLIFICATIONS:
 * - No per-arena statistics (would add overhead)
 * - No adaptive sizing (arena count is fixed)
 * - No NUMA awareness (all arenas are equal)
 */
typedef struct arena {
    pthread_mutex_t lock;  /* Protects all operations on this arena */
    block_meta *base;      /* Head of block list (first block in arena) */
} arena_t;

/*
 * Global arena array
 * 
 * INITIALIZATION:
 * C99 designated initializers ensure all mutexes start properly initialized.
 * This avoids race conditions during first use.
 */
static arena_t arenas[NUM_ARENAS] = {
    [0 ... NUM_ARENAS - 1] = { PTHREAD_MUTEX_INITIALIZER, NULL }
};

/*
 * Global mmap tracking
 * 
 * WHY SEPARATE TRACKING?
 * mmap'd blocks aren't part of the heap and can't be coalesced with heap blocks.
 * We need a separate list to find them during free() since they're not in any
 * arena's block list.
 *
 * This could be eliminated by adding mmap blocks to arena lists, but that would
 * complicate coalescing logic and violate the principle that arenas manage
 * contiguous heap memory.
 */
static block_meta *mmap_blocks_head = NULL;
static pthread_mutex_t mmap_blocks_lock = PTHREAD_MUTEX_INITIALIZER;

#define META_SIZE (sizeof(block_meta))

/* Debug support - allows runtime enable/disable of debug output */
static int debug_enabled = 0;

#define DEBUG_PRINT(fmt, ...) \
    do { if (debug_enabled) fprintf(stderr, "[DEBUG] " fmt "\n", ##__VA_ARGS__); } while(0)

/* ========================== HELPER FUNCTIONS ============================= */

/*
 * get_arena: Maps current thread to an arena
 * 
 * THREAD DISTRIBUTION STRATEGY:
 * We use simple modulo hashing of the thread ID. This generally provides good
 * distribution because:
 * - Thread IDs are usually pointers, which have good randomness in lower bits
 * - OS often randomizes thread IDs for security (ASLR)
 * 
 * ALTERNATIVES CONSIDERED:
 * - True random selection (adds overhead, not deterministic)
 * - Thread-local storage (more complex, platform-specific)
 * - Work-stealing queues (too complex for our goals)
 */
static inline arena_t *get_arena(void) {
    uintptr_t tid = (uintptr_t)pthread_self();
    uint8_t idx = tid % NUM_ARENAS;
    DEBUG_PRINT("Thread %lu using arena %d", tid, idx);
    return &arenas[idx];
}

/*
 * validate_block: Sanity check for block metadata
 * 
 * DEFENSIVE PROGRAMMING:
 * Memory corruption often manifests as corrupted pointers or size fields.
 * This function catches common corruption patterns:
 * - NULL pointers where they shouldn't be
 * - Zero sizes (every block has some size)
 * - Invalid arena indices
 * - Obviously bad pointers (very low addresses, magic debug values)
 *
 * This won't catch all corruption but helps enormously during debugging.
 */
static int validate_block(block_meta *b, const char *context) {
    if (!b) {
        DEBUG_PRINT("NULL block in %s", context);
        return 0;
    }
    
    if (b->size == 0) {
        DEBUG_PRINT("Zero size block in %s", context);
        return 0;
    }
    
    if (b->arena_idx >= NUM_ARENAS) {
        DEBUG_PRINT("Invalid arena_idx %d in %s", b->arena_idx, context);
        return 0;
    }
    
    /* 
     * Common corruption patterns:
     * - Pointers in first page (0x0-0xFFF) indicate NULL + offset bugs
     * - 0xDEADBEEF is a common debug marker for freed memory
     * - Add more patterns as discovered during testing
     */
    if (b->next && ((uintptr_t)b->next < 0x1000 || (uintptr_t)b->next == 0xdeadbeef)) {
        DEBUG_PRINT("Suspicious next pointer %p in %s", b->next, context);
        return 0;
    }
    
    if (b->prev && ((uintptr_t)b->prev < 0x1000 || (uintptr_t)b->prev == 0xdeadbeef)) {
        DEBUG_PRINT("Suspicious prev pointer %p in %s", b->prev, context);
        return 0;
    }
    
    return 1;
}

/* ==================== BLOCK MANAGEMENT FUNCTIONS ========================= */

/*
 * split_block: Divide a block if it's significantly larger than needed
 * 
 * SPLITTING STRATEGY:
 * We only split if the remainder would be at least META_SIZE + 8 bytes.
 * This prevents creating tiny, unusable fragments.
 *
 * WHY 8 BYTES MINIMUM?
 * - Smallest useful allocation on 64-bit systems (one pointer)
 * - Prevents pathological fragmentation from 1-byte remainders
 * - Maintains alignment guarantees
 *
 * IMPLEMENTATION NOTES:
 * - The new block is created immediately after the requested space
 * - Pointer arithmetic must account for both metadata and user data
 * - All list pointers are updated to maintain doubly-linked invariants
 */
static void split_block(block_meta *b, size_t req) {
    if (!validate_block(b, "split_block")) return;
    
    /* Don't split if remainder would be too small */
    if (b->size < req + META_SIZE + 8) return;
    
    /* Calculate where the new block's metadata begins */
    block_meta *n = (block_meta *)((char *)(b + 1) + req);
    
    /* Initialize the new remainder block */
    n->size      = b->size - req - META_SIZE;
    n->next      = b->next;
    n->prev      = b;
    n->free      = 1;  /* Remainder is free */
    n->use_mmap  = 0;  /* Only heap blocks are split */
    n->arena_idx = b->arena_idx;

    /* Update list pointers to insert new block */
    if (n->next) {
        if (!validate_block(n->next, "split_block n->next")) return;
        n->next->prev = n;
    }
    b->next = n;
    b->size = req;  /* Original block now exactly requested size */
    
    DEBUG_PRINT("Split block: original=%zu, requested=%zu, remainder=%zu", 
                b->size + META_SIZE + n->size, req, n->size);
}

/*
 * coalesce: Merge adjacent free blocks to reduce fragmentation
 * 
 * COALESCING STRATEGY:
 * We use immediate coalescing - blocks are merged as soon as they're freed.
 * This has trade-offs:
 * 
 * ADVANTAGES:
 * - Heap remains maximally consolidated
 * - Better chance of satisfying large allocations
 * - Simpler than deferred coalescing
 *
 * DISADVANTAGES:
 * - Extra work during free() even if block will be reallocated soon
 * - Can't coalesce with blocks that will be freed in the near future
 *
 * IMPLEMENTATION:
 * We check both directions (next and prev) because we maintain a doubly-linked
 * list. The order matters - we do forward coalescing first so that backward
 * coalescing potentially merges an even larger block.
 */
static void coalesce(block_meta *b) {
    if (!validate_block(b, "coalesce")) return;
    
    /*
     * Forward coalescing: merge with next block if it's free
     * 
     * Note: We can't coalesce mmap blocks because they're not contiguous
     * with heap blocks (different memory regions entirely).
     */
    if (b->next && b->next->free && !b->next->use_mmap) {
        if (!validate_block(b->next, "coalesce next")) return;
        
        DEBUG_PRINT("Coalescing with next block: %zu + %zu", b->size, b->next->size);
        
        /* Absorb the next block's space, including its metadata */
        b->size += META_SIZE + b->next->size;
        b->next  = b->next->next;
        if (b->next) {
            if (!validate_block(b->next, "coalesce next->next")) return;
            b->next->prev = b;
        }
    }
    
    /*
     * Backward coalescing: merge with previous block if it's free
     * 
     * This is trickier because we're essentially removing ourselves from
     * the list and expanding the previous block to include us.
     */
    if (b->prev && b->prev->free && !b->prev->use_mmap) {
        if (!validate_block(b->prev, "coalesce prev")) return;
        
        DEBUG_PRINT("Coalescing with prev block: %zu + %zu", b->prev->size, b->size);
        
        /* Previous block absorbs our space */
        b->prev->size += META_SIZE + b->size;
        b->prev->next  = b->next;
        if (b->next) {
            if (!validate_block(b->next, "coalesce prev merge")) return;
            b->next->prev = b->prev;
        }
        /* Note: block b is now invalid - it's been absorbed */
    }
}

/* ==================== ALLOCATION FUNCTIONS =============================== */

/*
 * find_free_block: Search arena for a suitable free block
 * 
 * ALGORITHM: First-fit
 * We take the first block that's large enough. This is simple and works well
 * for many workloads, though it can lead to fragmentation over time.
 *
 * WHY NOT BEST-FIT?
 * - Best-fit requires scanning entire list (always O(n))
 * - Studies show first-fit often performs comparably
 * - First-fit can return immediately upon finding suitable block
 *
 * The 'last' parameter is an optimization - if we don't find a suitable block,
 * we know where to append new memory without rescanning.
 */
static block_meta *find_free_block(block_meta **last, size_t size, arena_t *a) {
    block_meta *b = a->base;
    *last = NULL;
    
    while (b) {
        if (!validate_block(b, "find_free_block")) {
            DEBUG_PRINT("Invalid block found in arena, stopping search");
            break;
        }
        
        if (b->free && b->size >= size) {
            DEBUG_PRINT("Found free block: size=%zu, requested=%zu", b->size, size);
            return b;
        }
        *last = b;
        b = b->next;
    }
    return NULL;
}

/*
 * request_space: Allocate new memory from the OS
 * 
 * DUAL STRATEGY:
 * Small allocations use sbrk() to extend the heap, while large allocations
 * use mmap() to get independent memory regions.
 *
 * WHY TWO STRATEGIES?
 * - sbrk is fast but can only grow (not shrink) the heap
 * - mmap is slower but memory can be returned to OS immediately
 * - Large allocations via mmap don't fragment the heap
 * - Different workloads benefit from different strategies
 */
static block_meta *request_space(block_meta *last, size_t size, arena_t *a, uint8_t arena_idx) {
    DEBUG_PRINT("Requesting space: size=%zu, arena=%d", size, arena_idx);
    
    /*
     * LARGE ALLOCATION PATH (mmap)
     * 
     * For allocations >= 64KB, we bypass the heap entirely and get memory
     * directly from the OS via mmap. This has several advantages:
     * - Doesn't fragment the heap with large blocks
     * - Memory returned immediately to OS on free
     * - Can allocate very large blocks (limited only by virtual memory)
     */
    if (size >= MMAP_THRESHOLD) {
        DEBUG_PRINT("Using mmap for large allocation");
        
        /* Round up to page size for efficiency */
        size_t pagesz = (size_t)sysconf(_SC_PAGESIZE);
        size_t total  = ALIGN8(size + META_SIZE);
        total = ((total + pagesz - 1) / pagesz) * pagesz;
        
        DEBUG_PRINT("mmap: pagesz=%zu, total=%zu", pagesz, total);
        
        /*
         * mmap FLAGS EXPLAINED:
         * - PROT_READ | PROT_WRITE: Memory is readable and writable
         * - MAP_PRIVATE: Changes don't affect other processes
         * - MAP_ANONYMOUS: Not backed by a file (just RAM)
         * - fd=-1, offset=0: Required for anonymous mappings
         */
        void *mem = mmap(NULL, total, PROT_READ | PROT_WRITE,
                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
        if (mem == MAP_FAILED) {
            DEBUG_PRINT("mmap failed");
            return NULL;
        }
        
        /* Initialize block metadata at start of mmap'd region */
        block_meta *b = (block_meta *)mem;
        b->size      = total - META_SIZE;
        b->next      = NULL;  /* mmap blocks are standalone */
        b->prev      = NULL;
        b->free      = 0;
        b->use_mmap  = 1;     /* Critical: marks for munmap on free */
        b->arena_idx = arena_idx;
        
        DEBUG_PRINT("mmap block created: addr=%p, size=%zu", (void*)(b+1), b->size);
        
        /*
         * Add to global mmap list for tracking
         * This is necessary because mmap blocks aren't in arena lists,
         * but we need to find them during free() operations.
         */
        pthread_mutex_lock(&mmap_blocks_lock);
        if (mmap_blocks_head) {
            mmap_blocks_head->prev = b;
        }
        b->next = mmap_blocks_head;
        b->prev = NULL;
        mmap_blocks_head = b;
        pthread_mutex_unlock(&mmap_blocks_lock);
        
        return b;
    }

    /*
     * SMALL ALLOCATION PATH (sbrk)
     * 
     * For smaller allocations, we extend the process heap using sbrk().
     * This is fast but irreversible - the heap only grows.
     *
     * HISTORICAL NOTE:
     * sbrk() is deprecated on some systems in favor of mmap(), but we use
     * it here for educational purposes and because it clearly shows the
     * heap growth concept.
     */
    DEBUG_PRINT("Using sbrk for allocation");
    
    void *mem = sbrk(size + META_SIZE);
    if (mem == (void *)-1) {
        DEBUG_PRINT("sbrk failed");
        return NULL;
    }
    
    /* Initialize the new block */
    block_meta *b = (block_meta *)mem;
    b->size      = size;
    b->next      = NULL;
    b->prev      = last;
    b->free      = 0;
    b->use_mmap  = 0;
    b->arena_idx = arena_idx;

    /* Link into arena's block list */
    if (last) {
        if (!validate_block(last, "request_space last")) {
            DEBUG_PRINT("Invalid last block");
            return NULL;
        }
        last->next = b;
    } else {
        /* First block in this arena */
        a->base = b;
    }

    DEBUG_PRINT("Created new sbrk block: addr=%p, size=%zu", (void*)(b+1), b->size);
    return b;
}

/* ========================= PUBLIC API ==================================== */

/*
 * malloc: Allocate memory
 * 
 * IMPLEMENTATION OVERVIEW:
 * 1. Round size up to alignment boundary
 * 2. Select arena based on thread ID
 * 3. For large allocations, bypass arena and use mmap
 * 4. For small allocations:
 *    a. Lock arena
 *    b. Search for suitable free block
 *    c. If found, split if needed and return
 *    d. If not found, request new space from OS
 *    e. Unlock arena
 *
 * CONCURRENCY NOTES:
 * - Large allocations don't need arena lock (mmap is thread-safe)
 * - Arena lock held for minimum time necessary
 * - No allocations while holding lock (sbrk/mmap outside critical section)
 */
void *malloc(size_t size) {
    /* Handle edge case: malloc(0) returns NULL per standard */
    if (size == 0) return NULL;
    
    /* Ensure alignment for all allocations */
    size = ALIGN8(size);

    DEBUG_PRINT("malloc(%zu) after alignment", size);

    /* Select arena for this thread */
    arena_t *a = get_arena();
    uint8_t idx = (uint8_t)(a - arenas);
    
    DEBUG_PRINT("malloc(%zu) in arena %d", size, idx);
    
    /*
     * OPTIMIZATION: Large allocations bypass arena entirely
     * 
     * Since large blocks use mmap and aren't part of the arena's free list,
     * there's no point in searching the free list or acquiring the arena lock.
     * This significantly reduces contention for applications that mix small
     * and large allocations.
     */
    if (size >= MMAP_THRESHOLD) {
        DEBUG_PRINT("Large allocation (%zu >= %d), skipping arena", size, MMAP_THRESHOLD);
        block_meta *b = request_space(NULL, size, a, idx);
        if (!b) {
            DEBUG_PRINT("malloc failed to get mmap space");
            return NULL;
        }
        DEBUG_PRINT("malloc returning new mmap block: %p", (void*)(b+1));
        return (void *)(b + 1);
    }
    
    /* Small allocation path - need arena lock */
    int lock_result = pthread_mutex_lock(&a->lock);
    if (lock_result != 0) {
        DEBUG_PRINT("Failed to lock arena %d: %d", idx, lock_result);
        return NULL;
    }

    /* Search for suitable free block */
    block_meta *last = NULL;
    block_meta *b = find_free_block(&last, size, a);
    
    if (b) {
        /* Found existing free block - reuse it */
        DEBUG_PRINT("Found existing free block for size %zu", size);
        split_block(b, size);  /* Split if significantly larger than needed */
        b->free = 0;           /* Mark as in-use */
        pthread_mutex_unlock(&a->lock);
        DEBUG_PRINT("malloc returning existing block: %p", (void*)(b+1));
        return (void *)(b + 1);
    }

    /* No suitable free block - need to request more memory */
    DEBUG_PRINT("No suitable free block found, requesting new space");
    b = request_space(last, size, a, idx);
    pthread_mutex_unlock(&a->lock);
    
    if (!b) {
        DEBUG_PRINT("malloc failed to get space");
        return NULL;
    }
    
    DEBUG_PRINT("malloc returning new block: %p", (void*)(b+1));
    return (void *)(b + 1);
}

/*
 * free: Release allocated memory
 * 
 * IMPLEMENTATION:
 * 1. Handle NULL pointer (no-op per standard)
 * 2. Get block metadata from user pointer
 * 3. Validate block to catch corruption
 * 4. If mmap'd, remove from global list and munmap
 * 5. If heap block:
 *    a. Lock owning arena
 *    b. Mark as free
 *    c. Coalesce with neighbors
 *    d. Unlock arena
 *
 * SAFETY NOTES:
 * - Double-free detection via validation
 * - Arena index validated to prevent out-of-bounds access
 * - Coalescing happens under lock to prevent races
 */
void free(void *ptr) {
    /* free(NULL) is explicitly allowed by standard */
    if (!ptr) return;
    
    DEBUG_PRINT("free(%p)", ptr);
    
    /* Get block metadata from user pointer */
    block_meta *b = (block_meta *)ptr - 1;
    
    /* Validate to catch corruption or double-free */
    if (!validate_block(b, "free")) {
        DEBUG_PRINT("Attempting to free invalid block");
        return;
    }

    /*
     * MMAP PATH: Return memory directly to OS
     * 
     * mmap'd blocks require special handling:
     * - Not part of any arena list
     * - Must be removed from global mmap tracking list
     * - Memory returned to OS via munmap
     */
    if (b->use_mmap) {
        DEBUG_PRINT("Freeing mmap block");
        
        /* Remove from global tracking list */
        pthread_mutex_lock(&mmap_blocks_lock);
        if (b->prev) {
            b->prev->next = b->next;
        } else {
            mmap_blocks_head = b->next;
        }
        if (b->next) {
            b->next->prev = b->prev;
        }
        pthread_mutex_unlock(&mmap_blocks_lock);
        
        /* Return memory to OS */
        size_t total = b->size + META_SIZE;
        munmap((void *)b, total);
        return;
    }

    /*
     * HEAP PATH: Mark as free and coalesce
     * 
     * Heap blocks remain in the arena list but are marked free.
     * Coalescing merges adjacent free blocks to combat fragmentation.
     */
    
    /* Validate arena index to prevent out-of-bounds access */
    if (b->arena_idx >= NUM_ARENAS) {
        DEBUG_PRINT("Invalid arena index in free: %d", b->arena_idx);
        return;
    }

    /* Lock the owning arena */
    arena_t *a = &arenas[b->arena_idx];
    int lock_result = pthread_mutex_lock(&a->lock);
    if (lock_result != 0) {
        DEBUG_PRINT("Failed to lock arena %d in free: %d", b->arena_idx, lock_result);
        return;
    }
    
    /* Mark as free and attempt to merge with neighbors */
    b->free = 1;
    coalesce(b);
    
    pthread_mutex_unlock(&a->lock);
    DEBUG_PRINT("free completed");
}

/*
 * realloc: Resize an allocation
 * 
 * SEMANTICS (per C standard):
 * - realloc(NULL, size) equivalent to malloc(size)
 * - realloc(ptr, 0) equivalent to free(ptr)
 * - Returns ptr if new size fits in existing block
 * - Otherwise allocates new block, copies data, frees old block
 *
 * OPTIMIZATIONS:
 * - If shrinking, split block and return same pointer
 * - If growing slightly and next block is free, merge and avoid copy
 * - Only copy minimum of old and new sizes
 */
void *realloc(void *ptr, size_t size) {
    /* Handle special cases */
    if (!ptr) return malloc(size);
    if (size == 0) {
        free(ptr);
        return NULL;
    }
    
    /* Align size like malloc does */
    size = ALIGN8(size);

    /* Get and validate block metadata */
    block_meta *b = (block_meta *)ptr - 1;
    if (!validate_block(b, "realloc")) {
        DEBUG_PRINT("realloc called with invalid block");
        return NULL;
    }
    
    /* OPTIMIZATION: If shrinking, just split and return same pointer */
    if (b->size >= size) {
        split_block(b, size);
        return ptr;
    }

    /*
     * OPTIMIZATION: Try to extend into next block if it's free
     * 
     * This avoids the copy overhead if we can grow in-place.
     * Only works for heap blocks (not mmap) and requires the next
     * block to be free with sufficient space.
     */
    if (!b->use_mmap && b->next && b->next->free && !b->next->use_mmap &&
        (b->size + META_SIZE + b->next->size) >= size) {
        
        arena_t *a = &arenas[b->arena_idx];
        pthread_mutex_lock(&a->lock);
        
        /* Merge with next block */
        coalesce(b);
        
        /* Split if we merged more than needed */
        split_block(b, size);
        
        /* Ensure block is marked as in-use */
        b->free = 0;
        
        pthread_mutex_unlock(&a->lock);
        return (void *)(b + 1);
    }

    /*
     * FALLBACK: Allocate new block, copy data, free old block
     * 
     * This is the slowest path but handles all cases correctly.
     * We copy only the minimum of old and new sizes to avoid
     * reading beyond allocated memory.
     */
    void *newp = malloc(size);
    if (!newp) return NULL;
    
    /* Copy old data to new location */
    memcpy(newp, ptr, b->size < size ? b->size : size);
    
    /* Free old block */
    free(ptr);
    
    return newp;
}

/*
 * calloc: Allocate zero-initialized memory
 * 
 * INTERFACE: calloc(nmemb, size) allocates nmemb * size bytes, zeroed
 * 
 * SAFETY CONSIDERATIONS:
 * - Must check for integer overflow (nmemb * size could wrap)
 * - Entire allocation must be zeroed before returning
 * - Same alignment guarantees as malloc
 *
 * WHY SEPARATE FROM MALLOC?
 * - Some systems can optimize zeroing (e.g., copy-on-write zero pages)
 * - Explicit zeroing intent helps with security (no data leaks)
 * - C standard requires this interface
 */
void *calloc(size_t nmemb, size_t size) {
    size_t total;
    
    /*
     * OVERFLOW PROTECTION
     * 
     * If nmemb * size would overflow size_t, we must fail safely.
     * __builtin_mul_overflow is a GCC/Clang intrinsic that performs
     * multiplication and returns true if overflow occurred.
     *
     * Example: calloc(SIZE_MAX, 2) would overflow and must return NULL.
     */
    if (__builtin_mul_overflow(nmemb, size, &total)) {
        errno = ENOMEM;
        return NULL;
    }
    
    /* Allocate memory normally */
    void *p = malloc(total);
    if (p) {
        /* Zero the entire allocation */
        memset(p, 0, total);
    }
    
    return p;
}

/* ===================== TESTING/DEBUG FUNCTIONS =========================== */

#ifdef MINI_MALLOC_TESTING

/*
 * TESTING PHILOSOPHY:
 * 
 * These functions expose internal state for testing and debugging.
 * They're conditionally compiled to avoid overhead in production builds.
 * 
 * KEY CAPABILITIES:
 * - Runtime debug enable/disable
 * - Arena statistics gathering
 * - Heap consistency validation
 * - Visual heap inspection
 *
 * These tools proved invaluable during development and remain useful
 * for understanding allocator behavior in production scenarios.
 */

/*
 * mini_malloc_enable_debug: Toggle debug output
 * 
 * Allows runtime control of debug messaging without recompilation.
 * Useful for debugging specific issues without drowning in output.
 */
void mini_malloc_enable_debug(int enable) {
    debug_enabled = enable;
}

/*
 * mini_malloc_get_current_arena_idx: Get calling thread's arena
 * 
 * Exposes arena selection for testing distribution and affinity.
 * Helps verify threads are spreading across arenas as expected.
 */
uint8_t mini_malloc_get_current_arena_idx(void) {
    arena_t *a = get_arena();
    return (uint8_t)(a - arenas);
}

/*
 * mini_malloc_get_arena_stats: Gather statistics for one arena
 * 
 * STATISTICS TRACKED:
 * - total_blocks: Number of blocks (free + allocated)
 * - free_blocks: Number of free blocks
 * - total_size: Sum of all block sizes
 * - free_size: Sum of free block sizes
 *
 * USES:
 * - Fragmentation analysis (many small free blocks = fragmented)
 * - Memory efficiency (free_size / total_size ratio)
 * - Load balancing (compare stats across arenas)
 *
 * THREAD SAFETY:
 * Acquires arena lock to ensure consistent snapshot.
 */
int mini_malloc_get_arena_stats(uint8_t arena_idx, 
                                size_t *total_blocks,
                                size_t *free_blocks, 
                                size_t *total_size,
                                size_t *free_size) {
    if (arena_idx >= NUM_ARENAS) return -1;
    
    arena_t *a = &arenas[arena_idx];
    pthread_mutex_lock(&a->lock);
    
    /* Initialize counters */
    *total_blocks = 0;
    *free_blocks = 0;
    *total_size = 0;
    *free_size = 0;
    
    /* Walk the block list gathering statistics */
    block_meta *b = a->base;
    while (b) {
        if (!validate_block(b, "get_arena_stats")) {
            pthread_mutex_unlock(&a->lock);
            return -1;
        }
        
        (*total_blocks)++;
        *total_size += b->size;
        
        if (b->free) {
            (*free_blocks)++;
            *free_size += b->size;
        }
        b = b->next;
    }
    
    pthread_mutex_unlock(&a->lock);
    return 0;
}

/*
 * mini_malloc_get_global_stats: Aggregate statistics across all arenas
 * 
 * Provides system-wide view of allocator state.
 * Also tracks mmap allocations separately since they're not in arenas.
 *
 * NOTE: Takes multiple locks sequentially, so snapshot may be inconsistent
 * if allocations occur during collection.
 */
void mini_malloc_get_global_stats(size_t *total_blocks,
                                  size_t *free_blocks,
                                  size_t *total_size,
                                  size_t *free_size,
                                  size_t *mmap_blocks,
                                  size_t *mmap_size) {
    /* Initialize all counters */
    *total_blocks = 0;
    *free_blocks = 0;
    *total_size = 0;
    *free_size = 0;
    *mmap_blocks = 0;
    *mmap_size = 0;
    
    /* Gather stats from each arena */
    for (int i = 0; i < NUM_ARENAS; i++) {
        size_t arena_total, arena_free, arena_size, arena_free_size;
        if (mini_malloc_get_arena_stats(i, &arena_total, &arena_free, 
                                        &arena_size, &arena_free_size) == 0) {
            *total_blocks += arena_total;
            *free_blocks += arena_free;
            *total_size += arena_size;
            *free_size += arena_free_size;
        }
        
        /*
         * Count mmap blocks in this arena
         * Note: This is currently redundant since mmap blocks aren't in
         * arena lists, but kept for future flexibility.
         */
        arena_t *a = &arenas[i];
        pthread_mutex_lock(&a->lock);
        block_meta *b = a->base;
        while (b) {
            if (validate_block(b, "global_stats") && b->use_mmap) {
                (*mmap_blocks)++;
                *mmap_size += b->size;
            }
            b = b->next;
        }
        pthread_mutex_unlock(&a->lock);
    }
}

/*
 * mini_malloc_validate_heap: Check heap consistency
 * 
 * VALIDATIONS PERFORMED:
 * - Block metadata sanity (size, arena index)
 * - Linked list consistency (forward/backward pointers match)
 * - No loops in lists
 * - All blocks belong to correct arena
 *
 * RETURN: 1 if heap valid, 0 if corruption detected
 *
 * USE CASES:
 * - Debugging memory corruption
 * - Regression testing
 * - Production monitoring (periodic validation)
 */
int mini_malloc_validate_heap(void) {
    for (int i = 0; i < NUM_ARENAS; i++) {
        arena_t *a = &arenas[i];
        pthread_mutex_lock(&a->lock);
        
        block_meta *b = a->base;
        block_meta *prev = NULL;
        
        while (b) {
            /* Basic block validation */
            if (!validate_block(b, "validate_heap")) {
                pthread_mutex_unlock(&a->lock);
                return 0;
            }
            
            /* Check backward pointer consistency */
            if (b->prev != prev) {
                DEBUG_PRINT("Heap corruption: backward pointer mismatch");
                pthread_mutex_unlock(&a->lock);
                return 0;
            }
            
            /* Verify block belongs to this arena */
            if (b->arena_idx != i) {
                DEBUG_PRINT("Heap corruption: arena index mismatch");
                pthread_mutex_unlock(&a->lock);
                return 0;
            }
            
            /* Check forward pointer consistency */
            if (b->next && b->next->prev != b) {
                DEBUG_PRINT("Heap corruption: forward pointer mismatch");
                pthread_mutex_unlock(&a->lock);
                return 0;
            }
            
            prev = b;
            b = b->next;
        }
        
        pthread_mutex_unlock(&a->lock);
    }
    return 1;
}

/*
 * mini_malloc_debug_print_arenas: Visual heap inspection
 * 
 * Prints human-readable representation of heap state.
 * Useful for understanding fragmentation patterns and debugging.
 *
 * OUTPUT FORMAT:
 * - Each arena listed separately
 * - Blocks shown with size, free status, and address
 * - mmap blocks listed separately
 *
 * LIMITATIONS:
 * - Output truncated to prevent overwhelming terminal
 * - Takes multiple locks, so snapshot may be inconsistent
 */
void mini_malloc_debug_print_arenas(void) {
    printf("\n=== Mini Malloc Debug Info ===\n");
    
    /* Print each arena's state */
    for (int i = 0; i < NUM_ARENAS; i++) {
        arena_t *a = &arenas[i];
        pthread_mutex_lock(&a->lock);
        
        printf("Arena %d:\n", i);
        
        if (!a->base) {
            printf("  (empty)\n");
            pthread_mutex_unlock(&a->lock);
            continue;
        }
        
        block_meta *b = a->base;
        int block_num = 0;
        
        /* Limit output to prevent infinite loops from corruption */
        while (b && block_num < 20) {
            if (!validate_block(b, "debug_print")) {
                printf("  Block %d: INVALID/CORRUPTED\n", block_num);
                break;
            }
            printf("  Block %d: size=%zu, free=%d, mmap=%d, addr=%p\n",
                   block_num++, b->size, b->free, b->use_mmap, (void*)(b+1));
            b = b->next;
        }
        
        if (block_num >= 20) {
            printf("  ... (truncated)\n");
        }
        
        pthread_mutex_unlock(&a->lock);
    }
    
    /* Print mmap blocks separately */
    printf("Mmap Blocks:\n");
    pthread_mutex_lock(&mmap_blocks_lock);
    block_meta *mmap_block = mmap_blocks_head;
    int mmap_count = 0;
    while (mmap_block && mmap_count < 20) {
        printf("  Mmap %d: size=%zu, addr=%p\n", 
               mmap_count++, mmap_block->size, (void*)(mmap_block+1));
        mmap_block = mmap_block->next;
    }
    if (!mmap_blocks_head) {
        printf("  (no mmap blocks)\n");
    }
    pthread_mutex_unlock(&mmap_blocks_lock);
    
    printf("==============================\n\n");
}

/*
 * mini_malloc_force_coalesce_all: Force coalescing in all arenas
 * 
 * PURPOSE:
 * Sometimes for testing we want to force maximum coalescing to measure
 * fragmentation or test coalescing logic.
 *
 * IMPLEMENTATION:
 * Walks all arenas and attempts to coalesce every free block.
 * This is redundant (blocks are coalesced on free) but useful for testing.
 */
void mini_malloc_force_coalesce_all(void) {
    for (int i = 0; i < NUM_ARENAS; i++) {
        arena_t *a = &arenas[i];
        pthread_mutex_lock(&a->lock);
        
        block_meta *b = a->base;
        while (b) {
            if (validate_block(b, "force_coalesce") && b->free) {
                coalesce(b);
            }
            b = b->next;
        }
        
        pthread_mutex_unlock(&a->lock);
    }
}

#endif /* MINI_MALLOC_TESTING */

/*
 * ======================== END OF IMPLEMENTATION ==========================
 * 
 * SUMMARY:
 * 
 * This allocator demonstrates fundamental memory management concepts while
 * remaining simple enough to understand and modify. Key achievements:
 *
 * - Thread-safe operation via arena-based locking
 * - Dual allocation strategy (sbrk/mmap) for different size classes  
 * - Immediate coalescing to combat fragmentation
 * - Comprehensive debugging and validation support
 * - Clean, documented implementation under 800 lines
 *
 * WHAT WE LEARNED:
 *
 * 1. Simplicity has value - our straightforward design performs adequately
 *    for many use cases while remaining understandable.
 *
 * 2. Trade-offs are everywhere - every decision (arena count, coalescing
 *    strategy, metadata placement) involves compromises.
 *
 * 3. Testing is crucial - our debug infrastructure caught numerous bugs
 *    during development and remains valuable for understanding behavior.
 *
 * 4. Real-world considerations matter - alignment, thread safety, and
 *    platform differences significantly impact the implementation.
 *
 * FUTURE DIRECTIONS:
 *
 * This allocator provides a foundation for experimentation:
 * - Add size classes for better small object performance
 * - Implement thread-local caching to reduce lock contention
 * - Experiment with different coalescing strategies
 * - Add memory usage profiling and statistics
 * - Explore lock-free algorithms for hot paths
 *
 * The journey from basic memory management to production allocators is long,
 * but mini_malloc provides a solid first step on that path.
 */

test_mini_malloc.c

/*
 * test_mini_malloc.c - Comprehensive test suite for mini_malloc
 *
 * Compile with: gcc -DMINI_MALLOC_TESTING -o test_mini_malloc test_mini_malloc.c mini_malloc.c -lpthread
 * Run with: ./test_mini_malloc
 *
 * ---------------------------------------------------------------------------
 * TEST DESIGN THINKING
 * ---------------------------------------------------------------------------
 * This suite is structured to systematically validate every aspect of the
 * allocator, from basic correctness to production readiness. The design
 * philosophy follows a layered approach:
 *
 * 1) FUNCTIONAL CORRECTNESS
 *    • Ensure malloc, free, realloc, calloc behave exactly per the C spec.
 *    • Basic allocation/deallocation, zero-size behavior, NULL handling.
 *
 * 2) BOUNDARY & EDGE CASES
 *    • Overflow protection in calloc (guard against SIZE_MAX wraps).
 *    • realloc(NULL, size) and realloc(ptr, 0) semantics.
 *    • free(NULL) no-op guarantee.
 *
 * 3) FRAGMENTATION CONTROL & MEMORY LAYOUT
 *    • Splitting of free blocks: small alloc reuses larger free space.
 *    • Coalescing adjacent frees: free patterns recombine to reduce fragmentation.
 *    • Allocation alignment: 8-byte alignment for safety/atomicity.
 *
 * 4) MMAP vs. BRK PATHS
 *    • Triggering mmap for large allocations above threshold.
 *    • Verifying use_mmap metadata flag and region accessibility.
 *
 * 5) CONCURRENCY & THREAD SAFETY
 *    • Multi-threaded random allocation/free fuzz to expose races.
 *    • Arena distribution: per-thread arena selection to reduce lock contention.
 *
 * 6) STRESS & FUZZ TESTING
 *    • High-iteration random allocation patterns simulate real-world chaos.
 *    • Combined allocate/free calls stress fragmentation and merging logic.
 *
 * 7) WHITE-BOX INTROSPECTION
 *    • Internal metadata checks: block size, free flag, arena index.
 *    • Global statistics before/after allocations to validate counters.
 *    • Heap validation API: consistency checks on internal linked lists.
 *
 * Each test reports a single PASS/FAIL with clear diagnostics and aggregates
 * results in a summary. You can grep failures with:
 *
 *   ./test_mini_malloc 2>&1 | grep -E "(FAIL|ERROR)"
 *
 * Debug hooks can be enabled via mini_malloc_enable_debug() for deeper tracing.
 */

#include "mini_malloc.h"

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <unistd.h>
#include <pthread.h>
#include <sys/wait.h>
#include <time.h>
#include <stdint.h>

// Test framework macros
#define TEST_ASSERT(condition, message) \
    do { \
        if (!(condition)) { \
            printf("❌ FAIL: %s - %s\n", __func__, message); \
            return 0; \
        } \
    } while(0)

#define TEST_PASS(message) \
    do { \
        printf("✅ PASS: %s - %s\n", __func__, message); \
        return 1; \
    } while(0)

#define RUN_TEST(test_func) \
    do { \
        printf("\n--- Running " #test_func " ---\n"); \
        if (test_func()) { \
            tests_passed++; \
        } else { \
            tests_failed++; \
        } \
        total_tests++; \
    } while(0)

// Global test counters
static int tests_passed = 0;
static int tests_failed = 0;
static int total_tests = 0;

// Test functions
int test_basic_malloc_free() {
    void *ptr = malloc(100);
    TEST_ASSERT(ptr != NULL, "malloc(100) should not return NULL");
    
    // Write to the allocated memory
    memset(ptr, 0xAA, 100);
    
    // Verify we can read it back
    char *cptr = (char*)ptr;
    for (int i = 0; i < 100; i++) {
        TEST_ASSERT(cptr[i] == (char)0xAA, "Memory should be writable and readable");
    }
    
    free(ptr);
    TEST_PASS("Basic malloc/free works");
}

int test_zero_size_malloc() {
    void *ptr = malloc(0);
    TEST_ASSERT(ptr == NULL, "malloc(0) should return NULL");
    TEST_PASS("malloc(0) returns NULL correctly");
}


/*
 * test_large_allocation
 * ---------------------
 * 1) size_t n = 128*1024;
 *    // Choose size > MMAP threshold
 * 2) ptr = malloc(n);
 *    // Should invoke mmap path
 * 3) TEST_ASSERT(ptr...);
 *    // Ensure non-NULL
 * 4) memset(ptr, 0x55, n);
 *    // Write pattern across entire region
 * 5) TEST_ASSERT(...[0]==0x55);
 *    // Check first byte
 * 6) TEST_ASSERT(...[n-1]==0x55);
 *    // Check last byte
 * 7) free(ptr);
 *    // Release mmap region
 * 8) TEST_PASS(...);
 */
int test_large_allocation() {
    // Test allocation larger than MMAP_THRESHOLD (64KB)
    size_t large_size = 128 * 1024; // 128KB
    void *ptr = malloc(large_size);
    TEST_ASSERT(ptr != NULL, "Large allocation should succeed");
    
    // Test writing to it
    memset(ptr, 0x55, large_size);
    
    // Verify a few bytes
    char *cptr = (char*)ptr;
    TEST_ASSERT(cptr[0] == 0x55, "First byte should be writable");
    TEST_ASSERT(cptr[large_size-1] == 0x55, "Last byte should be writable");
    
    free(ptr);
    TEST_PASS("Large allocation via mmap works");
}


/*
 * test_calloc
 * -----------
 * for each i in 0..total-1:
 *   TEST_ASSERT(((unsigned char*)ptr)[i]==0);
 */

int test_calloc() {
    size_t nmemb = 10;
    size_t size = 20;
    void *ptr = calloc(nmemb, size);
    TEST_ASSERT(ptr != NULL, "calloc should not return NULL");
    
    // Verify memory is zeroed
    char *cptr = (char*)ptr;
    for (size_t i = 0; i < nmemb * size; i++) {
        TEST_ASSERT(cptr[i] == 0, "calloc should zero-initialize memory");
    }
    
    free(ptr);
    TEST_PASS("calloc works and zeros memory");
}



/*
 * test_calloc_overflow
 * ---------------------
 * Verify overflow detection by passing multipliers that wrap.
 */
int test_calloc_overflow() {
    // Test for overflow protection - use values that will overflow when multiplied
    // but are not statically detectable by the compiler
    size_t large_nmemb = (SIZE_MAX / 2) + 1;
    void *ptr = calloc(large_nmemb, 2);
    TEST_ASSERT(ptr == NULL, "calloc should handle overflow");
    
    // Another overflow test - use variables to avoid static analysis
    volatile size_t nmemb = 0x100000000ULL;  // 2^32
    volatile size_t size = 0x100000000ULL;   // 2^32, product will overflow on 64-bit
    ptr = calloc(nmemb, size);
    TEST_ASSERT(ptr == NULL, "calloc should handle overflow with large values");
    
    TEST_PASS("calloc handles overflow correctly");
}

/*
 * test_realloc_basic
 * -------------------
 * WHAT: realloc growth and shrink, preserves data.
 * WHY: realloc must maintain old contents correctly.
 */
int test_realloc_basic() {
    void *ptr = malloc(50);
    TEST_ASSERT(ptr != NULL, "Initial malloc should succeed");
    
    // Fill with pattern
    memset(ptr, 0x77, 50);
    
    // Expand
    ptr = realloc(ptr, 100);
    TEST_ASSERT(ptr != NULL, "realloc expand should succeed");
    
    // Check original data is preserved
    char *cptr = (char*)ptr;
    for (int i = 0; i < 50; i++) {
        TEST_ASSERT(cptr[i] == 0x77, "Original data should be preserved in realloc");
    }
    
    // Shrink
    ptr = realloc(ptr, 25);
    TEST_ASSERT(ptr != NULL, "realloc shrink should succeed");
    
    // Check data still there
    for (int i = 0; i < 25; i++) {
        TEST_ASSERT(cptr[i] == 0x77, "Data should be preserved when shrinking");
    }
    
    free(ptr);
    TEST_PASS("realloc basic functionality works");
}

/*
 * test_realloc_null_ptr
 * ----------------------
 * WHAT: realloc(NULL, size)
 * WHY: Must behave like malloc(size).
 */
int test_realloc_null_ptr() {
    // realloc(NULL, size) should behave like malloc(size)
    void *ptr = realloc(NULL, 100);
    TEST_ASSERT(ptr != NULL, "realloc(NULL, size) should work like malloc");
    
    memset(ptr, 0x88, 100);
    free(ptr);
    TEST_PASS("realloc(NULL, size) works like malloc");
}

/*
 * test_realloc_zero_size
 * -----------------------
 * WHAT: realloc(ptr, 0)
 * WHY: Must free(ptr) and return NULL.
 */
int test_realloc_zero_size() {
    void *ptr = malloc(100);
    TEST_ASSERT(ptr != NULL, "Initial malloc should succeed");
    
    // realloc(ptr, 0) should behave like free(ptr)
    ptr = realloc(ptr, 0);
    TEST_ASSERT(ptr == NULL, "realloc(ptr, 0) should return NULL");
    
    TEST_PASS("realloc(ptr, 0) works like free");
}


/*
 * test_multiple_allocations
 * --------------------------
 * WHAT: Allocate 10 blocks of increasing size, fill each uniquely,
 *      verify integrity, then free all.
 * WHY: Detects cross-block corruption and tests isolation.
 */
int test_multiple_allocations() {
    const int num_allocs = 10;
    void *ptrs[num_allocs];
    
    // Allocate multiple blocks
    for (int i = 0; i < num_allocs; i++) {
        ptrs[i] = malloc(100 + i * 10);
        TEST_ASSERT(ptrs[i] != NULL, "Multiple allocations should succeed");
        
        // Fill with unique pattern
        memset(ptrs[i], i, 100 + i * 10);
    }
    
    // Verify data integrity
    for (int i = 0; i < num_allocs; i++) {
        char *cptr = (char*)ptrs[i];
        for (int j = 0; j < 100 + i * 10; j++) {
            TEST_ASSERT(cptr[j] == i, "Data integrity should be maintained");
        }
    }
    
    // Free all
    for (int i = 0; i < num_allocs; i++) {
        free(ptrs[i]);
    }
    
    TEST_PASS("Multiple allocations work correctly");
}

/*
 * test_fragmentation_and_coalescing
 * ----------------------------------
 * WHAT: Allocate A,B,C; free B; allocate smaller D in B;
 *      free A,C,D; allocate E larger than any single hole.
 * WHY: Tests splitting of free block for D and coalescing for E.
 */
int test_fragmentation_and_coalescing() {
    // Allocate several blocks
    void *ptr1 = malloc(100);
    void *ptr2 = malloc(100);
    void *ptr3 = malloc(100);
    
    TEST_ASSERT(ptr1 && ptr2 && ptr3, "Initial allocations should succeed");
    
    // Free middle block to create fragmentation
    free(ptr2);
    
    // Allocate a small block that should fit in the freed space
    void *ptr4 = malloc(50);
    TEST_ASSERT(ptr4 != NULL, "Small allocation should succeed");
    
    // Free adjacent blocks to test coalescing
    free(ptr1);
    free(ptr3);
    free(ptr4);
    
    // Allocate a larger block that should use coalesced space
    void *ptr5 = malloc(250);
    TEST_ASSERT(ptr5 != NULL, "Large allocation after coalescing should succeed");
    
    free(ptr5);
    TEST_PASS("Fragmentation and coalescing work");
}

/*
 * test_alignment
 * ---------------
 * WHAT: malloc sizes 1..100, check pointer %8==0
 * WHY: Ensures alignment for all data types.
 */
int test_alignment() {
    void *ptr = malloc(1);
    TEST_ASSERT(ptr != NULL, "malloc(1) should succeed");
    
    // Check 8-byte alignment using header function
    TEST_ASSERT(mini_malloc_is_aligned(ptr), "Allocated memory should be 8-byte aligned");
    
    // Test multiple sizes
    for (size_t size = 1; size <= 100; size++) {
        void *p = malloc(size);
        TEST_ASSERT(p != NULL, "malloc should succeed");
        TEST_ASSERT(mini_malloc_is_aligned(p), "All allocations should be aligned");
        free(p);
    }
    
    free(ptr);
    TEST_PASS("Memory alignment is correct");
}

/*
 * test_free_null
 * --------------
 * WHAT: free(NULL)
 * WHY: Must be safe no-op.
 */
int test_free_null() {
    // free(NULL) should be safe
    free(NULL);
    TEST_PASS("free(NULL) is safe");
}

/*
 * test_thread_safety
 * ------------------
 * WHAT: Spawn 4 threads, each does 10 iterations of alloc/write/read/free
 * WHY: Stresses concurrent arenas and locks.
 */
// Thread test data
struct thread_test_data {
    int thread_id;
    int iterations;
    int success;
};

void* thread_test_func(void* arg) {
    struct thread_test_data* data = (struct thread_test_data*)arg;
    void* ptrs[10]; // Reduced from 100 to 10
    
    printf("Thread %d starting\n", data->thread_id);
    
    for (int i = 0; i < data->iterations; i++) {
        // Initialize array
        for (int j = 0; j < 10; j++) {
            ptrs[j] = NULL;
        }
        
        // Allocate
        for (int j = 0; j < 10; j++) {
            size_t size = (rand() % 100) + 1; // Smaller allocations
            ptrs[j] = malloc(size);
            if (!ptrs[j]) {
                printf("Thread %d: malloc failed at iteration %d, allocation %d\n", 
                       data->thread_id, i, j);
                data->success = 0;
                return NULL;
            }
            // Write to memory to test it's valid
            memset(ptrs[j], (char)(data->thread_id + j), size);
        }
        
        // Validate memory before freeing
        for (int j = 0; j < 10; j++) {
            if (ptrs[j]) {
                char expected = (char)(data->thread_id + j);
                if (((char*)ptrs[j])[0] != expected) {
                    printf("Thread %d: memory corruption detected at iteration %d\n", 
                           data->thread_id, i);
                    data->success = 0;
                    return NULL;
                }
            }
        }
        
        // Free all
        for (int j = 0; j < 10; j++) {
            if (ptrs[j]) {
                free(ptrs[j]);
                ptrs[j] = NULL;
            }
        }
        
        if (i % 5 == 0) {
            printf("Thread %d: completed iteration %d\n", data->thread_id, i);
        }
    }
    
    printf("Thread %d completed successfully\n", data->thread_id);
    data->success = 1;
    return NULL;
}

int test_thread_safety() {
    const int num_threads = 4;
    const int iterations = 10; // Reduced for debugging
    pthread_t threads[num_threads];
    struct thread_test_data data[num_threads];
    
    // Don't enable debug for normal test runs
    // mini_malloc_enable_debug(1);
    
    printf("Starting thread safety test with %d threads, %d iterations each\n", 
           num_threads, iterations);
    
    srand(time(NULL));
    
    // Create threads
    for (int i = 0; i < num_threads; i++) {
        data[i].thread_id = i;
        data[i].iterations = iterations;
        data[i].success = 0;
        
        printf("Creating thread %d\n", i);
        int ret = pthread_create(&threads[i], NULL, thread_test_func, &data[i]);
        TEST_ASSERT(ret == 0, "Thread creation should succeed");
    }
    
    // Wait for threads
    for (int i = 0; i < num_threads; i++) {
        printf("Waiting for thread %d\n", i);
        pthread_join(threads[i], NULL);
        printf("Thread %d completed with success=%d\n", i, data[i].success);
        TEST_ASSERT(data[i].success == 1, "Thread should complete successfully");
    }
    
    // Disable debug output
    mini_malloc_enable_debug(0);
    
    TEST_PASS("Thread safety test passed");
}



/*
 * test_stress
 * -----------
 * WHAT: 1000 iterations of random alloc/free patterns
 * WHY: Simulates real-world chaotic usage.
 */
int test_stress() {
    const int num_iterations = 1000;
    void* ptrs[100];
    
    srand(time(NULL));
    
    for (int i = 0; i < num_iterations; i++) {
        // Random allocation pattern
        int num_allocs = rand() % 100 + 1;
        
        // Allocate
        for (int j = 0; j < num_allocs; j++) {
            size_t size = rand() % 10000 + 1;
            ptrs[j] = malloc(size);
            if (ptrs[j]) {
                // Write random data
                memset(ptrs[j], rand() % 256, size);
            }
        }
        
        // Free randomly
        for (int j = 0; j < num_allocs; j++) {
            if (ptrs[j] && (rand() % 2)) {
                free(ptrs[j]);
                ptrs[j] = NULL;
            }
        }
        
        // Free remaining
        for (int j = 0; j < num_allocs; j++) {
            if (ptrs[j]) {
                free(ptrs[j]);
            }
        }
    }
    
    TEST_PASS("Stress test completed");
}

int test_block_metadata() {
    void *ptr = malloc(100);
    TEST_ASSERT(ptr != NULL, "malloc should succeed");
    
    block_meta_t *meta = mini_malloc_get_block_meta(ptr);
    TEST_ASSERT(meta != NULL, "Block metadata should be accessible");
    TEST_ASSERT(meta->size >= 100, "Block size should be at least requested size");
    TEST_ASSERT(meta->free == 0, "Allocated block should not be marked free");
    TEST_ASSERT(meta->arena_idx < MINI_MALLOC_NUM_ARENAS, "Arena index should be valid");
    
    free(ptr);
    TEST_PASS("Block metadata is correct");
}


/*
 * test_arena_distribution
 * ------------------------
 * WHAT: malloc in main thread, check arena_idx
 * WHY: Confirms per-thread arena mapping.
 */
int test_arena_distribution() {
    // Test that different threads get different arenas
    uint8_t arena_idx = mini_malloc_get_current_arena_idx();
    TEST_ASSERT(arena_idx < MINI_MALLOC_NUM_ARENAS, "Arena index should be valid");
    
    // Allocate in current arena and verify
    void *ptr = malloc(100);
    TEST_ASSERT(ptr != NULL, "malloc should succeed");
    
    block_meta_t *meta = mini_malloc_get_block_meta(ptr);
    TEST_ASSERT(meta->arena_idx == arena_idx, "Block should be in expected arena");
    
    free(ptr);
    TEST_PASS("Arena distribution works correctly");
}


/*
 * test_statistics
 * ----------------
 * WHAT: Compare global stats before/after allocs
 * WHY: Verifies internal counters track properly.
 */
int test_statistics() {
    // Test statistics by comparing before/after a specific allocation
    size_t before_total, before_free, before_size, before_free_size;
    size_t before_mmap_blocks, before_mmap_size;
    
    // Get baseline stats
    mini_malloc_get_global_stats(&before_total, &before_free, 
                                 &before_size, &before_free_size,
                                 &before_mmap_blocks, &before_mmap_size);
    
    printf("Before allocation: total_blocks=%zu, total_size=%zu, free_blocks=%zu, free_size=%zu\n", 
           before_total, before_size, before_free, before_free_size);
    
    // Allocate some memory that should definitely create new blocks
    // Use unusual sizes to avoid reusing existing free blocks
    void *ptr1 = malloc(137);  // Unusual size
    void *ptr2 = malloc(239);  // Unusual size
    TEST_ASSERT(ptr1 && ptr2, "Allocations should succeed");
    
    // Get stats after allocation
    size_t after_total, after_free, after_size, after_free_size;
    size_t after_mmap_blocks, after_mmap_size;
    
    mini_malloc_get_global_stats(&after_total, &after_free,
                                 &after_size, &after_free_size,
                                 &after_mmap_blocks, &after_mmap_size);
    
    printf("After allocation: total_blocks=%zu, total_size=%zu, free_blocks=%zu, free_size=%zu\n", 
           after_total, after_size, after_free, after_free_size);
    
    // The fundamental issue: when we allocate from existing free blocks,
    // total_size stays the same but free_size decreases.
    // This is actually correct behavior!
    
    printf("Total blocks increased by: %zu\n", after_total - before_total);
    printf("Free size decreased by: %zu\n", before_free_size - after_free_size);
    
    // Check that blocks were tracked properly
    TEST_ASSERT(after_total >= before_total, "Total blocks should increase or stay same");
    
    // When allocating from existing free space, free size should decrease
    if (before_free_size > 0 && after_free_size < before_free_size) {
        printf("Successfully allocated from existing free space\n");
        size_t allocated_from_free = before_free_size - after_free_size;
        TEST_ASSERT(allocated_from_free >= 376, "Should have allocated at least 376 bytes from free space");
    } else if (after_size > before_size) {
        printf("Successfully created new blocks\n");
        TEST_ASSERT(after_size >= before_size + 376, "Total size should increase when creating new blocks");
    } else {
        // This is actually fine - we might have done both operations
        printf("Allocation completed successfully\n");
    }
    
    // Free the memory
    free(ptr1);
    free(ptr2);
    
    TEST_PASS("Statistics tracking works");
}


/*
 * test_heap_validation
 * --------------------
 * WHAT: Calls heap validator after alloc/free steps
 * WHY: Detects inconsistencies or corruption.
 */
int test_heap_validation() {
    // Allocate some memory to create a heap
    void *ptr1 = malloc(100);
    void *ptr2 = malloc(200);
    void *ptr3 = malloc(300);
    
    TEST_ASSERT(ptr1 && ptr2 && ptr3, "Allocations should succeed");
    
    // Validate heap consistency
    int valid = mini_malloc_validate_heap();
    TEST_ASSERT(valid == 1, "Heap should be consistent");
    
    // Free some blocks
    free(ptr2);
    
    // Should still be valid
    valid = mini_malloc_validate_heap();
    TEST_ASSERT(valid == 1, "Heap should remain consistent after free");
    
    free(ptr1);
    free(ptr3);
    
    TEST_PASS("Heap validation works");
}


/*
 * test_large_block_mmap
 * ---------------------
 * WHAT: Allocates a block double the mmap threshold with debug on,
 *      checks the use_mmap flag and writes to the region.
 * WHY: Ensures the allocator routes large requests through mmap and
 *      properly flags and manages those regions.
 */
int test_large_block_mmap() {
    // Use a very large size to ensure no existing free block can satisfy it
    size_t large_size = 2 * MINI_MALLOC_MMAP_THRESHOLD;  // 128KB
    
    printf("Allocating %zu bytes (threshold is %d)\n", large_size, MINI_MALLOC_MMAP_THRESHOLD);
    printf("Will this use mmap? %s\n", (large_size >= MINI_MALLOC_MMAP_THRESHOLD) ? "YES" : "NO");
    
    // Enable debug to see the allocation path
    mini_malloc_enable_debug(1);
    
    void *ptr = malloc(large_size);
    TEST_ASSERT(ptr != NULL, "Large allocation should succeed");
    
    mini_malloc_enable_debug(0);
    
    // Check it's marked as mmap
    block_meta_t *meta = mini_malloc_get_block_meta(ptr);
    TEST_ASSERT(meta != NULL, "Block metadata should exist");
    
    printf("Block metadata: size=%zu, use_mmap=%d, requested=%zu\n", 
           meta->size, meta->use_mmap, large_size);
    
    if (meta->use_mmap != 1) {
        printf("ERROR: Large block not using mmap!\n");
        printf("Requested: %zu, Threshold: %d\n", large_size, MINI_MALLOC_MMAP_THRESHOLD);
        printf("This suggests the allocation reused an existing large free block\n");
        
        // Print debug info to see what happened
        mini_malloc_debug_print_arenas();
    }
    
    TEST_ASSERT(meta->use_mmap == 1, "Large block should use mmap");
    
    // Write to it to ensure it's accessible
    memset(ptr, 0x42, large_size);
    
    free(ptr);
    TEST_PASS("Large block mmap handling works");
}

void print_test_summary() {
    printf("\n==================================================\n");
    printf("TEST SUMMARY\n");
    printf("==================================================\n");
    printf("Total tests: %d\n", total_tests);
    printf("Passed: %d\n", tests_passed);
    printf("Failed: %d\n", tests_failed);
    printf("Success rate: %.1f%%\n", 
           total_tests > 0 ? (100.0 * tests_passed / total_tests) : 0.0);
    
    if (tests_failed == 0) {
        printf("\n🎉 All tests passed! Your malloc implementation looks good.\n");
    } else {
        printf("\n⚠️  Some tests failed. Check the implementation.\n");
    }
}

int main() {
    printf("Starting mini_malloc test suite...\n");
    
    // Run all tests
    RUN_TEST(test_basic_malloc_free);
    RUN_TEST(test_zero_size_malloc);
    RUN_TEST(test_large_allocation);
    RUN_TEST(test_calloc);
    RUN_TEST(test_calloc_overflow);
    RUN_TEST(test_realloc_basic);
    RUN_TEST(test_realloc_null_ptr);
    RUN_TEST(test_realloc_zero_size);
    RUN_TEST(test_multiple_allocations);
    RUN_TEST(test_fragmentation_and_coalescing);
    RUN_TEST(test_alignment);
    RUN_TEST(test_free_null);
    RUN_TEST(test_thread_safety);
    RUN_TEST(test_stress);
    
    // Advanced tests using testing API
    RUN_TEST(test_block_metadata);
    RUN_TEST(test_arena_distribution);
    RUN_TEST(test_statistics);
    RUN_TEST(test_heap_validation);
    RUN_TEST(test_large_block_mmap);
    
    print_test_summary();
    
    return tests_failed > 0 ? 1 : 0;
}

output

Starting mini_malloc test suite...

--- Running test_basic_malloc_free ---
✅ PASS: test_basic_malloc_free - Basic malloc/free works

--- Running test_zero_size_malloc ---
✅ PASS: test_zero_size_malloc - malloc(0) returns NULL correctly

--- Running test_large_allocation ---
✅ PASS: test_large_allocation - Large allocation via mmap works

--- Running test_calloc ---
✅ PASS: test_calloc - calloc works and zeros memory

--- Running test_calloc_overflow ---
✅ PASS: test_calloc_overflow - calloc handles overflow correctly

--- Running test_realloc_basic ---
✅ PASS: test_realloc_basic - realloc basic functionality works

--- Running test_realloc_null_ptr ---
✅ PASS: test_realloc_null_ptr - realloc(NULL, size) works like malloc

--- Running test_realloc_zero_size ---
✅ PASS: test_realloc_zero_size - realloc(ptr, 0) works like free

--- Running test_multiple_allocations ---
✅ PASS: test_multiple_allocations - Multiple allocations work correctly

--- Running test_fragmentation_and_coalescing ---
✅ PASS: test_fragmentation_and_coalescing - Fragmentation and coalescing work

--- Running test_alignment ---
✅ PASS: test_alignment - Memory alignment is correct

--- Running test_free_null ---
✅ PASS: test_free_null - free(NULL) is safe

--- Running test_thread_safety ---
Starting thread safety test with 4 threads, 10 iterations each
Creating thread 0
Creating thread 1
Thread 0 starting
Thread 0: completed iteration 0
Creating thread 2
Thread 1 starting
Thread 1: completed iteration 0
Thread 0: completed iteration 5
Thread 1: completed iteration 5
Thread 0 completed successfully
Thread 2 starting
Thread 2: completed iteration 0
Creating thread 3
Thread 1 completed successfully
Thread 2: completed iteration 5
Thread 2 completed successfully
Thread 3 starting
Thread 3: completed iteration 0
Thread 3: completed iteration 5
Waiting for thread 0
Thread 3 completed successfully
Thread 0 completed with success=1
Waiting for thread 1
Thread 1 completed with success=1
Waiting for thread 2
Thread 2 completed with success=1
Waiting for thread 3
Thread 3 completed with success=1
✅ PASS: test_thread_safety - Thread safety test passed

--- Running test_stress ---
✅ PASS: test_stress - Stress test completed

--- Running test_block_metadata ---
✅ PASS: test_block_metadata - Block metadata is correct

--- Running test_arena_distribution ---
✅ PASS: test_arena_distribution - Arena distribution works correctly

--- Running test_statistics ---
Before allocation: total_blocks=6, total_size=551344, free_blocks=1, free_size=549232
After allocation: total_blocks=8, total_size=551280, free_blocks=1, free_size=548784
Total blocks increased by: 2
Free size decreased by: 448
Successfully allocated from existing free space
✅ PASS: test_statistics - Statistics tracking works

--- Running test_heap_validation ---
✅ PASS: test_heap_validation - Heap validation works

--- Running test_large_block_mmap ---
Allocating 131072 bytes (threshold is 65536)
Will this use mmap? YES
[DEBUG] malloc(131072) after alignment
[DEBUG] Thread 140589820630848 using arena 0
[DEBUG] malloc(131072) in arena 0
[DEBUG] Large allocation (131072 >= 65536), skipping free block search
[DEBUG] Requesting space: size=131072, arena=0
[DEBUG] MMAP_THRESHOLD is 65536, will use mmap: YES
[DEBUG] Using mmap for large allocation
[DEBUG] mmap: pagesz=4096, total=135168
[DEBUG] mmap block created: addr=0x7fdd9c4f0020, size=135136, use_mmap=1
[DEBUG] malloc returning new mmap block: 0x7fdd9c4f0020
Block metadata: size=135136, use_mmap=1, requested=131072
✅ PASS: test_large_block_mmap - Large block mmap handling works

==================================================
TEST SUMMARY
==================================================
Total tests: 19
Passed: 19
Failed: 0
Success rate: 100.0%

🎉 All tests passed! Your malloc implementation looks good.

End of Document

Table of Contents

1. Introduction & Motivation

Why Another Allocator?

2. Requirements Analysis

Functional Requirements

Non-Functional Requirements

Design Constraints

3. Architectural Overview

High-Level Design

Component Responsibilities

4. Core Data Structures

Block Metadata

Arena Structure

Memory Layout

Block Metadata

Memory Layout Visualization

Arena Structure

Arena System Visualization

Doubly-Linked List Structure in Arena

5. Memory Management Strategies

Small Block Allocation (< 64KB)

Large Block Allocation (≥ 64KB)

Threshold Selection

6. Implementation Deep Dive

6.1 Block Metadata Design

Header Placement Strategy

Metadata Validation

6.2 Arena-Based Concurrency

Thread-to-Arena Mapping

Lock Acquisition Pattern

6.3 Allocation Algorithm

First-Fit Search

Block Splitting

6.4 Deallocation & Coalescing

The Coalescing Algorithm

Why Doubly-Linked Lists Matter

6.5 Large Block Handling

mmap Allocation

Large Allocation (mmap) Visualization

6.6 Memory Alignment

Alignment Guarantees

7. Thread Safety Analysis

Locking Strategy

Lock Ordering and Deadlock Prevention

Race Condition Analysis

Memory Ordering Considerations

Performance Implications of Thread Safety

Thread Safety with Arenas

Memory Management Operation Examples

malloc() Operation Flow

Heap Growth with sbrk()

realloc() Operation Examples

Case 1: Shrinking (realloc smaller size)

Case 2: Growing with adjacent free block

calloc() Operation

8. Performance Characteristics

Time Complexity

Space Overhead

Cache Behavior

Benchmark Results

9. Testing & Validation

Test Categories

Testing Methodology

Debug Features

10. Real-World Considerations

Production Deployment

Security Considerations

Platform-Specific Issues

11. Comparison with Production Allocators

Feature Comparison

Design Philosophy Differences

12. Future Enhancements

Performance Improvements

Feature Additions

Algorithmic Improvements

13. Lessons Learned

Design Insights

Implementation Challenges

What Worked Well

What Could Be Better