Dung (Donny) Nguyen

Senior Software Engineer

Detailed PM2 Monitoring Guide for Bottleneck Detection

1. Basic PM2 Monitoring Commands

Real-Time Process Monitoring

pm2 monit

This opens an interactive dashboard showing:

Process Status Overview

pm2 status
# or
pm2 ls

Shows a table with:

Detailed Process Information

pm2 show backend-api
pm2 show backend-queue

Provides:

2. Critical Metrics to Monitor

Assume in your ecosystem.config.cjs configuration, max_memory_restart: 300M, here are the key metrics:

A. Memory Usage ⚠️ CRITICAL FOR YOUR SETUP

pm2 show backend-api | grep memory
pm2 show backend-queue | grep memory

What to look for:

Action thresholds:

Check memory trend:

# Watch memory over time (refresh every 2 seconds)
watch -n 2 'pm2 status'

B. CPU Usage

pm2 status
# Look at 'cpu' column

What to look for:

Action thresholds:

Your cluster mode setup:

// From ecosystem.config.cjs
instances: 'max',  // Uses all CPU cores
exec_mode: 'cluster'

Check CPU distribution:

# See which core is being used
pm2 show backend-api | grep "exec cwd"
top -p $(pm2 jlist | jq -r '.[] | select(.name=="backend-api") | .pid' | tr '\n' ',' | sed 's/,$//')

C. Restart Count

pm2 status
# Look at 'restart' column

# See restart reason
pm2 show backend-api | grep -A5 "restart time"

What to look for:

Action thresholds:

D. Event Loop Lag (Advanced)

pm2 install pm2-server-monit
pm2 show backend-api
# Look for "Event Loop Latency"

What to look for:

Common causes:

3. Identifying Specific Bottlenecks

Memory Leak Detection

# Monitor memory over 30 minutes
pm2 show backend-api | grep "heap size" 
# Wait 5 minutes
pm2 show backend-api | grep "heap size"
# Repeat...

If heap size continuously grows:

# Take heap snapshot for analysis
node --inspect bin/server.js
# Or use clinic.js
npm install -g clinic
pm2 stop backend-api
clinic doctor -- node ace serve

Common memory leak sources in your stack:

CPU Bottleneck Detection

# If CPU consistently high, profile the app
pm2 install pm2-profiler
pm2 trigger backend-api profile
# Or use Node.js profiler
node --prof bin/server.js
node --prof-process isolate-*.log > processed.txt

Common CPU bottlenecks in your stack:

Queue Bottleneck Detection

pm2 logs backend-queue --lines 100 | grep "processing"
pm2 show backend-queue

Your queue worker config:

// From ecosystem.config.cjs
instances: 'max',  // Multiple workers
exec_mode: 'cluster',
cron_restart: '5 */4 * * *',  // Restart every 4 hours

Check for:

Monitor specific queues:

# If you add metrics endpoint (from previous answer)
curl http://localhost:3333/metrics/queues

Database Bottleneck Detection

# Check if high memory/CPU correlates with DB activity
pm2 logs backend-api --lines 200 | grep -i "query\|database"

# On MySQL server
mysql -u root -p -e "SHOW PROCESSLIST;"
mysql -u root -p -e "SELECT * FROM information_schema.INNODB_TRX;"

Signs of DB bottleneck:

Your DB config has replicas:

// From config/database.ts
replicas: {
  read: { connection: [...] },
  write: { connection: {...} }
}

Check replica usage:

# Enable query logging temporarily
# Set DB_DEBUG=true in .env
pm2 restart backend-api --update-env
pm2 logs backend-api | grep "SELECT"  # Should show replica host

Setup (Free tier: 4 servers, 20 apps)

pm2 plus
# Follow prompts to create account
# Get public/private keys

pm2 link <secret_key> <public_key>

What you get:

  1. Real-time metrics (1-second resolution)
  2. Historical graphs (CPU, memory, HTTP latency)
  3. Exception tracking with stack traces
  4. Custom metrics integration
  5. Alerting via email/Slack/webhook
  6. Deployment tracking
  7. Log streaming

Key PM2 Plus features for bottleneck detection:

A. HTTP Latency Distribution

B. Error Rate

C. Custom Metrics (add to your code):

// In your service files
import pmx from '@pm2/io'

const dbQueryTime = pmx.metric({
  name: 'Database Query Time',
  unit: 'ms'
})

// Before query
const start = Date.now()
await db.query()
dbQueryTime.set(Date.now() - start)

5. Practical Bottleneck Detection Workflow

Step 1: Establish Baseline

# During low traffic
pm2 status
pm2 show backend-api
pm2 show backend-queue
# Note: CPU ~5-10%, Memory ~100-150MB, 0 restarts

Step 2: Monitor During Peak Load

# During high traffic (or simulate with load test)
watch -n 1 'pm2 status'
# Compare with baseline

Step 3: Identify Anomaly

If CPU spikes:

pm2 logs backend-api --lines 50  # What endpoints are being hit?
pm2 show backend-api | grep cpu
# Profile if sustained high CPU

If memory spikes:

pm2 show backend-api | grep heap
pm2 logs backend-api | grep -i "error\|timeout"
# Check for memory-intensive operations

If restarts increase:

pm2 logs backend-api --err --lines 100  # Check error logs
pm2 show backend-api | grep "restart\|unstable"
# Check for uncaught exceptions

Step 4: Correlate with Application Logs

# Check AdonisJS logs
pm2 logs backend-api --lines 200 | grep -E "ERROR|WARN|slow"

# Check queue job logs
pm2 logs backend-queue --lines 200 | grep -E "failed|retry|timeout"

# Check specific log files (production uses file transport)
tail -f logs/*.log

Step 5: Deep Dive

For memory issues:

# Take heap snapshot
kill -USR2 $(pm2 jlist | jq -r '.[] | select(.name=="backend-api") | .pm_id' | head -1)
# Analyze with Chrome DevTools

For CPU issues:

# CPU profiling
pm2 stop backend-api
node --cpu-prof ace serve
# Generate load
# Stop and analyze
node --cpu-prof-process CPU.*.cpuprofile

6. Metrics to Track Over Time

Create a simple monitoring script:

#!/bin/bash
# save as: monitor.sh

while true; do
  echo "=== $(date) ===" >> metrics.log
  pm2 jlist | jq '.[] | select(.name=="backend-api" or .name=="backend-queue") | {name, cpu, memory, restarts}' >> metrics.log
  sleep 60
done
chmod +x monitor.sh
nohup ./monitor.sh &

# Analyze later
grep -o '"cpu":[0-9.]*' metrics.log | cut -d: -f2 | awk '{sum+=$1; count++} END {print "Avg CPU:", sum/count}'

7. Quick Bottleneck Diagnostic Cheatsheet

Symptom Likely Bottleneck Check
Memory approaches 300MB repeatedly Memory leak or insufficient limit pm2 show backend-api → heap usage trend
CPU 100% on all instances CPU-bound operations Profile code, check slow algorithms
High restarts with errors Application crashes pm2 logs --err, uncaught exceptions
Slow API responses DB queries, external APIs Enable query logging, check Shopify sync
Queue growing, not processing Worker stuck or too slow pm2 logs backend-queue, BullMQ dashboard
Memory spikes during specific operations Large object allocations Check Shopify sync, AWS IVS streams
Uneven CPU across instances Poor load balancing Check if sticky sessions needed
Event loop lag > 100ms Blocking operations Check for sync file I/O, heavy computation

Summary: Top 5 Metrics for Your Setup

  1. Memory usage → Watch for 300MB limit (most critical given your config)
  2. Restart count → Distinguishing between cron restarts and crashes
  3. CPU percentage → Identify processing bottlenecks
  4. Queue worker health → Ensure background jobs processing smoothly
  5. Error logs pattern → Early warning of issues

Start with pm2 monit for real-time view, then use pm2 show for detailed analysis when you spot anomalies.