Detailed PM2 Monitoring Guide for Bottleneck Detection

1. Basic PM2 Monitoring Commands

Real-Time Process Monitoring

pm2 monit

This opens an interactive dashboard showing:

CPU usage (per process)
Memory usage (per process)
Live logs (scrolling)
Custom metrics (if configured)

Process Status Overview

pm2 status
# or
pm2 ls

Shows a table with:

status: online/stopped/errored
cpu: current CPU percentage
mem: current memory usage
uptime: how long process has been running
restarts: number of times restarted
unstable restarts: restarts within min_uptime window

Detailed Process Information

pm2 show backend-api
pm2 show backend-queue

Provides:

Process ID, mode (cluster/fork), instances
Memory usage breakdown
Interpreter, script path
Restart count and reason for last restart
Environment variables
Log file paths

2. Critical Metrics to Monitor

Assume in your ecosystem.config.cjs configuration, max_memory_restart: 300M, here are the key metrics:

A. Memory Usage ⚠️ CRITICAL FOR YOUR SETUP

pm2 show backend-api | grep memory
pm2 show backend-queue | grep memory

What to look for:

Heap used approaching 300MB → Process will restart soon
Steady memory growth → Memory leak
Spiky memory usage → Large object allocations

Action thresholds:

< 200MB: Healthy
200-270MB: Monitor closely
> 270MB: Investigate immediately (restart imminent)
Frequent 300MB restarts: Increase limit or fix memory leak

Check memory trend:

# Watch memory over time (refresh every 2 seconds)
watch -n 2 'pm2 status'

B. CPU Usage

pm2 status
# Look at 'cpu' column

What to look for:

Consistent 100% CPU on ANY instance → CPU-bound bottleneck
All instances at high CPU → Need more workers or optimization
Cluster mode, but only 1 instance high → Uneven load distribution

Action thresholds:

< 70%: Healthy
70-85%: Monitor, may need optimization
> 85%: CPU bottleneck - optimize code or scale
100% sustained: Immediate investigation needed

Your cluster mode setup:

// From ecosystem.config.cjs
instances: 'max',  // Uses all CPU cores
exec_mode: 'cluster'

Check CPU distribution:

# See which core is being used
pm2 show backend-api | grep "exec cwd"
top -p $(pm2 jlist | jq -r '.[] | select(.name=="backend-api") | .pid' | tr '\n' ',' | sed 's/,$//')

C. Restart Count

pm2 status
# Look at 'restart' column

# See restart reason
pm2 show backend-api | grep -A5 "restart time"

What to look for:

Gradual restarts → Memory limit hits (expected with 300MB limit)
Rapid restarts (high “unstable restarts”) → Application crashes
Restart every 4 hours → Expected (your cron: 5 */4 * * *)
Many restarts between cron → Problem!

Action thresholds:

< 5 restarts/day (excluding cron): Healthy
5-10 restarts/day: Investigate logs
> 10 restarts/day: Critical issue - check error logs

D. Event Loop Lag (Advanced)

pm2 install pm2-server-monit
pm2 show backend-api
# Look for "Event Loop Latency"

What to look for:

< 10ms: Excellent
10-50ms: Good
50-100ms: Concerning - blocking operations
> 100ms: Severe - Node.js event loop blocked

Common causes:

Synchronous file operations
Heavy JSON parsing
Complex regex operations
Blocking database queries
CPU-intensive calculations

3. Identifying Specific Bottlenecks

Memory Leak Detection

# Monitor memory over 30 minutes
pm2 show backend-api | grep "heap size" 
# Wait 5 minutes
pm2 show backend-api | grep "heap size"
# Repeat...

If heap size continuously grows:

# Take heap snapshot for analysis
node --inspect bin/server.js
# Or use clinic.js
npm install -g clinic
pm2 stop backend-api
clinic doctor -- node ace serve

Common memory leak sources in your stack:

Lucid ORM queries not properly released
BullMQ job handlers not cleaning up
AWS IVS player instances not destroyed
Event listeners not removed
Redis connections not closed

CPU Bottleneck Detection

# If CPU consistently high, profile the app
pm2 install pm2-profiler
pm2 trigger backend-api profile
# Or use Node.js profiler
node --prof bin/server.js
node --prof-process isolate-*.log > processed.txt

Common CPU bottlenecks in your stack:

Complex Bouncer authorization checks (RBAC)
Large dataset serialization (JSON.stringify)
Heavy Shopify sync operations
Vector database operations (Pinecone)
Image processing
Regex in middleware

Queue Bottleneck Detection

pm2 logs backend-queue --lines 100 | grep "processing"
pm2 show backend-queue

Your queue worker config:

// From ecosystem.config.cjs
instances: 'max',  // Multiple workers
exec_mode: 'cluster',
cron_restart: '5 */4 * * *',  // Restart every 4 hours

Check for:

High restart count → Jobs crashing workers
Memory growing → Jobs not releasing resources
Logs show retries → Job failures (check BullMQ retry config)
Slow processing → DB queries in jobs are slow

Monitor specific queues:

# If you add metrics endpoint (from previous answer)
curl http://localhost:3333/metrics/queues

Database Bottleneck Detection

# Check if high memory/CPU correlates with DB activity
pm2 logs backend-api --lines 200 | grep -i "query\|database"

# On MySQL server
mysql -u root -p -e "SHOW PROCESSLIST;"
mysql -u root -p -e "SELECT * FROM information_schema.INNODB_TRX;"

Signs of DB bottleneck:

Slow API responses during business hours
Memory grows during sync jobs (Shopify sync)
Queue workers high CPU during data processing
Many active connections (check your connection pool)

Your DB config has replicas:

// From config/database.ts
replicas: {
  read: { connection: [...] },
  write: { connection: {...} }
}

Check replica usage:

# Enable query logging temporarily
# Set DB_DEBUG=true in .env
pm2 restart backend-api --update-env
pm2 logs backend-api | grep "SELECT"  # Should show replica host

4. PM2 Monitoring Dashboard (PM2 Plus - Recommended)

Setup (Free tier: 4 servers, 20 apps)

pm2 plus
# Follow prompts to create account
# Get public/private keys

pm2 link <secret_key> <public_key>

What you get:

Real-time metrics (1-second resolution)
Historical graphs (CPU, memory, HTTP latency)
Exception tracking with stack traces
Custom metrics integration
Alerting via email/Slack/webhook
Deployment tracking
Log streaming

Key PM2 Plus features for bottleneck detection:

A. HTTP Latency Distribution

See which routes are slow
Percentile view (p50, p95, p99)
Correlate with CPU/memory spikes

B. Error Rate

Track error patterns
See which endpoints fail
Stack traces for debugging

C. Custom Metrics (add to your code):

// In your service files
import pmx from '@pm2/io'

const dbQueryTime = pmx.metric({
  name: 'Database Query Time',
  unit: 'ms'
})

// Before query
const start = Date.now()
await db.query()
dbQueryTime.set(Date.now() - start)

5. Practical Bottleneck Detection Workflow

Step 1: Establish Baseline

# During low traffic
pm2 status
pm2 show backend-api
pm2 show backend-queue
# Note: CPU ~5-10%, Memory ~100-150MB, 0 restarts

Step 2: Monitor During Peak Load

# During high traffic (or simulate with load test)
watch -n 1 'pm2 status'
# Compare with baseline

Step 3: Identify Anomaly

If CPU spikes:

pm2 logs backend-api --lines 50  # What endpoints are being hit?
pm2 show backend-api | grep cpu
# Profile if sustained high CPU

If memory spikes:

pm2 show backend-api | grep heap
pm2 logs backend-api | grep -i "error\|timeout"
# Check for memory-intensive operations

If restarts increase:

pm2 logs backend-api --err --lines 100  # Check error logs
pm2 show backend-api | grep "restart\|unstable"
# Check for uncaught exceptions

Step 4: Correlate with Application Logs

# Check AdonisJS logs
pm2 logs backend-api --lines 200 | grep -E "ERROR|WARN|slow"

# Check queue job logs
pm2 logs backend-queue --lines 200 | grep -E "failed|retry|timeout"

# Check specific log files (production uses file transport)
tail -f logs/*.log

Step 5: Deep Dive

For memory issues:

# Take heap snapshot
kill -USR2 $(pm2 jlist | jq -r '.[] | select(.name=="backend-api") | .pm_id' | head -1)
# Analyze with Chrome DevTools

For CPU issues:

# CPU profiling
pm2 stop backend-api
node --cpu-prof ace serve
# Generate load
# Stop and analyze
node --cpu-prof-process CPU.*.cpuprofile

6. Metrics to Track Over Time

Create a simple monitoring script:

#!/bin/bash
# save as: monitor.sh

while true; do
  echo "=== $(date) ===" >> metrics.log
  pm2 jlist | jq '.[] | select(.name=="backend-api" or .name=="backend-queue") | {name, cpu, memory, restarts}' >> metrics.log
  sleep 60
done

chmod +x monitor.sh
nohup ./monitor.sh &

# Analyze later
grep -o '"cpu":[0-9.]*' metrics.log | cut -d: -f2 | awk '{sum+=$1; count++} END {print "Avg CPU:", sum/count}'

7. Quick Bottleneck Diagnostic Cheatsheet

Symptom	Likely Bottleneck	Check
Memory approaches 300MB repeatedly	Memory leak or insufficient limit	`pm2 show backend-api` → heap usage trend
CPU 100% on all instances	CPU-bound operations	Profile code, check slow algorithms
High restarts with errors	Application crashes	`pm2 logs --err`, uncaught exceptions
Slow API responses	DB queries, external APIs	Enable query logging, check Shopify sync
Queue growing, not processing	Worker stuck or too slow	`pm2 logs backend-queue`, BullMQ dashboard
Memory spikes during specific operations	Large object allocations	Check Shopify sync, AWS IVS streams
Uneven CPU across instances	Poor load balancing	Check if sticky sessions needed
Event loop lag > 100ms	Blocking operations	Check for sync file I/O, heavy computation

Summary: Top 5 Metrics for Your Setup

Memory usage → Watch for 300MB limit (most critical given your config)
Restart count → Distinguishing between cron restarts and crashes
CPU percentage → Identify processing bottlenecks
Queue worker health → Ensure background jobs processing smoothly
Error logs pattern → Early warning of issues

Start with pm2 monit for real-time view, then use pm2 show for detailed analysis when you spot anomalies.

Dung (Donny) Nguyen

Senior Software Engineer

Detailed PM2 Monitoring Guide for Bottleneck Detection

1. Basic PM2 Monitoring Commands

Real-Time Process Monitoring

Process Status Overview

Detailed Process Information

2. Critical Metrics to Monitor

A. Memory Usage ⚠️ CRITICAL FOR YOUR SETUP

B. CPU Usage

C. Restart Count

D. Event Loop Lag (Advanced)

3. Identifying Specific Bottlenecks

Memory Leak Detection

CPU Bottleneck Detection

Queue Bottleneck Detection

Database Bottleneck Detection

4. PM2 Monitoring Dashboard (PM2 Plus - Recommended)

Setup (Free tier: 4 servers, 20 apps)

What you get:

Key PM2 Plus features for bottleneck detection:

5. Practical Bottleneck Detection Workflow

Step 1: Establish Baseline

Step 2: Monitor During Peak Load

Step 3: Identify Anomaly

Step 4: Correlate with Application Logs

Step 5: Deep Dive

6. Metrics to Track Over Time

7. Quick Bottleneck Diagnostic Cheatsheet

Summary: Top 5 Metrics for Your Setup