Optimizing Go for High-Throughput Systems — Blog

Go is fast out of the box, but “fast enough” and “production-optimized” are different things. When your service handles tens of thousands of requests per second, small inefficiencies compound — unnecessary heap allocations, contention on shared mutexes, GC pauses at the worst possible moment. I’ve spent a lot of time profiling and optimizing Go services under heavy load, and this post covers the techniques that consistently make the biggest difference.

The cardinal rule: profile first, optimize second. Every technique below came from staring at pprof output, not from guessing.

Always start with profiling

Go’s built-in profiling tools are genuinely excellent. I add pprof to every production service from day one:

import (
	"log"
	"net/http"
	_ "net/http/pprof" // side-effect: registers /debug/pprof/* on http.DefaultServeMux
)

func init() {
	go func() {
		// Dedicated listener on localhost, using the DefaultServeMux that pprof registered on.
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()
}

One important caveat: net/http/pprof’s init() registers its handlers on http.DefaultServeMux as a side effect of the blank import. Don’t share http.DefaultServeMux with your application server — if your public HTTP server also uses nil (i.e. the default mux) as its handler, you’ve just exposed /debug/pprof/* to the internet. Create a dedicated http.NewServeMux() for application handlers and only let the pprof listener touch the default mux:

appMux := http.NewServeMux()
appMux.HandleFunc("/api/orders", handleOrders)
// ... more routes ...
go http.ListenAndServe(":8080", appMux)                 // public traffic, no pprof
go http.ListenAndServe("localhost:6060", nil)           // pprof on the default mux, localhost only

That’s it. Now you can grab CPU and memory profiles from a running service:

# 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Goroutine dump (great for finding leaks)
go tool pprof http://localhost:6060/debug/pprof/goroutine

For benchmarking specific functions, use Go’s testing package:

func BenchmarkProcessOrder(b *testing.B) {
	order := createTestOrder()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		processOrder(order)
	}
}

Run with go test -bench=. -benchmem — the -benchmem flag is critical because it shows allocations per operation, which is usually where the wins are.

Reduce allocations — this is almost always the biggest win

In every Go service I’ve profiled, the number one performance bottleneck is heap allocations. Every allocation eventually becomes GC work, and GC pauses are what kill your p99 latency.

Pre-allocate slices

This is the lowest-hanging fruit:

// Before: grows the slice dynamically, causes multiple allocations
func filterActive(users []User) []User {
	var active []User
	for _, u := range users {
		if u.Active {
			active = append(active, u)
		}
	}
	return active
}

// After: one allocation, right-sized
func filterActive(users []User) []User {
	active := make([]User, 0, len(users))
	for _, u := range users {
		if u.Active {
			active = append(active, u)
		}
	}
	return active
}

Yes, the pre-allocated version might waste memory if few users are active. I don’t care. The allocation cost dominates in hot paths.

Use sync.Pool for frequently allocated objects

For objects that are created and discarded at high frequency (buffers, temporary structs), sync.Pool eliminates the allocation entirely after warmup:

// Imports: "bytes", "sync"
var bufferPool = sync.Pool{
	New: func() any {
		return new(bytes.Buffer)
	},
}

func processRequest(data []byte) []byte {
	buf := bufferPool.Get().(*bytes.Buffer)
	defer func() {
		buf.Reset()
		bufferPool.Put(buf)
	}()

	buf.Write(data)
	buf.WriteString(" processed")
	return bytes.Clone(buf.Bytes())
}

I use this pattern for JSON encoding buffers, HTTP response writers, and any struct that shows up as a top allocator in heap profiles.

Use strings.Builder for concatenation

Never concatenate strings with + in a loop:

// Terrible: O(n^2) allocations
func join(strs []string) string {
	result := ""
	for _, s := range strs {
		result += s
	}
	return result
}

// Good: single allocation (requires "strings")
func join(strs []string) string {
	total := 0
	for _, s := range strs {
		total += len(s)
	}
	var b strings.Builder
	b.Grow(total)
	for _, s := range strs {
		b.WriteString(s)
	}
	return b.String()
}

The Grow call pre-allocates the internal buffer. Without it, strings.Builder still beats +, but with it you get exactly one allocation.

Avoid interface boxing in hot paths

Every time you assign a concrete type to an interface{}, Go may allocate to box the value. In hot paths, use concrete types:

// Causes allocation: the int is boxed into an interface
func sumAny(values []any) int {
	total := 0
	for _, v := range values {
		if n, ok := v.(int); ok {
			total += n
		}
	}
	return total
}

// No allocation: concrete types all the way
func sumInts(values []int) int {
	total := 0
	for _, v := range values {
		total += v
	}
	return total
}

This matters most in serialization code, middleware chains, and anything that processes every request.

Concurrency patterns that actually scale

Worker pools with bounded concurrency

Unlimited goroutines are a denial-of-service vulnerability. I use bounded worker pools for any work that involves external I/O:

type WorkerPool struct {
	tasks  chan func()
	wg     sync.WaitGroup
	closed chan struct{}
	once   sync.Once
}

func NewWorkerPool(workers, queueSize int) *WorkerPool {
	p := &WorkerPool{
		tasks:  make(chan func(), queueSize),
		closed: make(chan struct{}),
	}
	for i := 0; i < workers; i++ {
		p.wg.Add(1)
		go func() {
			defer p.wg.Done()
			for {
				select {
				case task := <-p.tasks:
					safeRun(task)
				case <-p.closed:
					// Drain whatever's still queued, then exit.
					for {
						select {
						case task := <-p.tasks:
							safeRun(task)
						default:
							return
						}
					}
				}
			}
		}()
	}
	return p
}

// safeRun isolates a single task's panic from the worker goroutine.
// Without this, one panicking task (nil deref, divide-by-zero, panic in a
// user-supplied fn) kills the worker permanently. Workers die silently, the
// queue backs up, and Shutdown()'s wg.Wait() hangs forever because wg.Done()
// never ran. Recover, log, move on.
func safeRun(task func()) {
	defer func() {
		if r := recover(); r != nil {
			log.Printf("workerpool: task panicked: %v", r)
			// In production, also emit a metric and capture the stack via debug.Stack().
		}
	}()
	task()
}

func (p *WorkerPool) Submit(task func()) bool {
	// Reject submissions once shutdown has started.
	select {
	case <-p.closed:
		return false
	default:
	}
	select {
	case p.tasks <- task:
		return true
	default:
		return false // queue full, apply backpressure
	}
}

func (p *WorkerPool) Shutdown() {
	p.once.Do(func() { close(p.closed) })
	p.wg.Wait()
}

Imports for the snippet above: log, sync. The pool uses the standard library only — no third-party dependencies.

safeRun exists because a panicking task would otherwise kill the worker goroutine permanently. A dead worker means wg.Done() never fires, so Shutdown()’s wg.Wait() blocks forever, and the queue silently backs up. Recovering per-task keeps the pool self-healing: one bad task logs, the worker picks up the next. Don’t skip this even if you “control all the callers” — nil map writes, divide-by-zero, and out-of-bounds slice access all panic, and they all eventually happen.

The non-blocking Submit is intentional. When the queue is full, I want to return a 503 to the caller rather than let goroutines pile up until the process OOMs.

The closed channel and sync.Once exist to avoid a crash bug that bites every naive worker-pool implementation: if Shutdown() closes tasks while a Submit() is mid-send, the send panics with send on closed channel and takes the process down. The fix is to never close tasks at all. Instead, close a separate closed signal; Submit checks it first and bails out, workers drain remaining tasks and exit when they see it. sync.Once makes Shutdown() idempotent — important because it’s usually called from both a signal handler and a defer in production services, and double-closing a channel panics too.

One race remains: a Submit can pass the closed guard and then send to tasks after Shutdown() returns. That’s benign here because tasks is never closed, so the send buffers (or hits default and gets rejected) and the queued task is simply dropped on process exit. If your tasks must not be dropped, promote the pool to a context.Context-aware variant and let callers serialize shutdown against inflight submits.

But Submit returning false is a real decision the caller has to make — don’t just log and forget. Your options, roughly in order of how often I pick each one:

Shed load (return 503 / 429). My default for synchronous request paths. The caller retries with backoff, the client gets honest signal that you’re overloaded, and your tail latency stays bounded. This is the whole point of backpressure.
Queue externally (Kafka, SQS, Redis). If the work is asynchronous anyway — emails, webhooks, analytics events — push rejected tasks to a durable queue with its own consumers. The in-memory pool becomes a fast path; the external queue is the overflow.
Retry with backoff, in-process. Only when the caller has nothing better to do and the work is idempotent. Cap the retries tightly (2-3) or you’ll just move the pileup into the calling goroutines.
Drop silently. Rare, but legitimate for high-volume fire-and-forget telemetry where stale data is worthless. If you do this, at minimum increment a counter — a silent drop you can’t see on a dashboard is a bug waiting to ship.

The wrong answer is “block the caller until there’s space” — that turns bounded concurrency back into unbounded, and you’ve defeated the pool.

Prevent goroutine leaks with context

Every goroutine you spawn should have a cancellation path. I’ve seen production services with 500,000+ leaked goroutines because someone forgot this:

// Imports: "context", "sync", "time" (fetchURL is assumed to be defined elsewhere)
func fetchAll(ctx context.Context, urls []string) []string {
	results := make(chan string, len(urls))
	var wg sync.WaitGroup

	for _, url := range urls {
		wg.Add(1)
		go func(url string) {
			defer wg.Done()

			fetchCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
			defer cancel()

			data, err := fetchURL(fetchCtx, url)
			if err != nil {
				return
			}

			select {
			case results <- data:
			case <-ctx.Done():
				return
			}
		}(url)
	}

	// Closer goroutine: always runs, always closes, even if ctx is cancelled.
	go func() {
		wg.Wait()
		close(results)
	}()

	var out []string
	for r := range results {
		out = append(out, r)
	}
	return out
}

Two things worth flagging here. The select on results with ctx.Done() is what keeps workers from blocking forever trying to send when nobody’s reading — without it, a cancelled parent context would leave every worker parked on a full buffer. The closer goroutine (wg.Wait(); close(results)) runs unconditionally so the main for r := range results loop terminates even on cancellation; if you guard that close behind a ctx check, you reintroduce the deadlock you were trying to avoid. This bug class — “producer exits on cancel, consumer keeps ranging, closer never fires” — is one of the most common goroutine leaks I see in review. Buffer-size matters too: len(urls) means a worker can always send without blocking; a smaller buffer turns every unread result into a cancellation-sensitive send.

Use buffered channels correctly

Unbuffered channels cause goroutines to block on every send. For producer-consumer patterns, size the buffer to absorb bursts:

// Bad: producer blocks on every item
ch := make(chan Event)

// Better: buffer absorbs burst, producer rarely blocks
ch := make(chan Event, 1000)

But don’t just pick a big number — profile to find where channels are contended and right-size based on actual throughput.

I/O optimization

Connection pooling

The default http.Client in Go reuses connections, but its defaults are conservative. For services that make many outbound calls:

func OptimizedHTTPClient() *http.Client {
	return &http.Client{
		Transport: &http.Transport{
			MaxIdleConns:        100,
			MaxIdleConnsPerHost: 100,
			MaxConnsPerHost:     100,
			IdleConnTimeout:     90 * time.Second,
			ForceAttemptHTTP2:   true,
		},
		Timeout: 30 * time.Second,
	}
}

MaxIdleConnsPerHost defaults to 2. If you’re making 50 concurrent requests to the same host, 48 of them are creating new connections. Bump this to match your actual concurrency.

Batch database writes

Individual inserts are expensive. Batch them:

func batchInsert(ctx context.Context, db *sql.DB, items []Item) error {
	const batchSize = 1000
	for i := 0; i < len(items); i += batchSize {
		end := min(i+batchSize, len(items))
		batch := items[i:end]

		tx, err := db.BeginTx(ctx, nil)
		if err != nil {
			return fmt.Errorf("begin tx: %w", err)
		}

		stmt, err := tx.PrepareContext(ctx, "INSERT INTO items (name, data) VALUES (?, ?)")
		if err != nil {
			tx.Rollback()
			return fmt.Errorf("prepare: %w", err)
		}

		for _, item := range batch {
			if _, err := stmt.ExecContext(ctx, item.Name, item.Data); err != nil {
				closeErr := stmt.Close()
				tx.Rollback()
				return fmt.Errorf("insert item %d: %w", item.ID, errors.Join(err, closeErr))
			}
		}
		if err := stmt.Close(); err != nil {
			tx.Rollback()
			return fmt.Errorf("stmt close: %w", err)
		}

		if err := tx.Commit(); err != nil {
			return fmt.Errorf("commit: %w", err)
		}
	}
	return nil
}

A batch of 1000 inserts in a single transaction is roughly 50x faster than 1000 individual inserts. The transaction reduces fsync calls from 1000 to 1.

Two details that trip people up. First, stmt.Close() can return an error — pooled statements, driver-side bookkeeping — and swallowing it means you’ll chase phantom connection-pool bugs later. errors.Join is the cleanest way to surface both the insert error and the close error without nesting. Second, partial-batch failure: as written, a single failing row rolls back the whole batch of 1000. That’s usually what you want (atomic batches, idempotent retries at the batch level). If you need per-row durability — say, ingesting dirty data where some rows are expected to fail — wrap each ExecContext in a savepoint, or split failed rows into a quarantine table on the second pass. Don’t try to be clever with continue-on-error inside a single transaction: most drivers abort the whole tx on the first error anyway.

GC tuning — use sparingly

Go’s GC is good. Most of the time, reducing allocations (above) is better than tuning GC parameters. But for specific workloads, two knobs matter:

import "runtime/debug"

func tuneForThroughput() {
	// Default GOGC is 100 (GC triggers when heap doubles).
	// 300 means GC triggers when heap reaches 4x live data.
	// Trades memory for fewer GC pauses.
	debug.SetGCPercent(300)

	// Hard memory limit -- essential in containers.
	// Prevents OOM kills when GOGC is high.
	debug.SetMemoryLimit(1 << 30) // 1 GB
}

SetGCPercent(300) with SetMemoryLimit is my go-to for latency-sensitive services in containers. The memory limit acts as a safety net — if heap pressure gets too high, Go triggers GC early regardless of the GOGC setting.

The old “memory ballast” trick (allocating a large []byte to inflate live heap size) is obsolete since Go 1.19 added SetMemoryLimit. Don’t use it.

What to optimize first

When I’m brought in to optimize a Go service, this is my checklist in order:

Add pprof, take a CPU and heap profile. Don’t guess.
Fix allocation hot spots. sync.Pool, pre-allocated slices, strings.Builder. This alone usually cuts p99 latency by 30-50%.
Fix goroutine leaks. Check the goroutine profile. If it’s growing over time, you have a leak.
Right-size connection pools for HTTP clients and database connections.
Batch I/O operations where possible.
Tune GC only if allocation reduction wasn’t enough.
Algorithm changes — if profiling shows a hot function with bad complexity, fix the algorithm. But this is rarer than people think.

The techniques that look clever (SoA layouts, SIMD, custom allocators) are almost never where the actual wins are. Profile first. The boring optimizations — pre-allocating a slice, pooling a buffer, batching database writes — are what move the needle in real systems.