Agent Monitoring

Comprehensive monitoring of EPMware Agents ensures reliable metadata deployments and helps identify issues before they impact operations. This guide covers monitoring strategies, health checks, and alerting configurations.

Monitoring Overview

The EPMware Agent monitoring encompasses: - Agent availability and uptime - Connection status to EPMware server - Deployment success rates - Resource utilization - Response times and performance metrics

Agent Status Monitoring

From EPMware Application

Monitor agent status directly from the EPMware web interface:

Navigate to Infrastructure → Servers
View the Agent Status column for each server
Right-click any server and select Test Connection

Status indicators: - 🟢 Online - Agent is connected and responding - 🟡 Warning - Agent responding slowly or intermittently - 🔴 Offline - Agent is not responding - ⚫ Unknown - No recent status update

Command Line Health Check

Linux:

#!/bin/bash
# Agent health check script
PID=$(ps -ef | grep -i epmware-agent | grep -v grep | awk '{print $2}')
if [ -z "$PID" ]; then
    echo "ERROR: Agent is not running"
    exit 1
else
    echo "OK: Agent is running (PID: $PID)"
    exit 0
fi

Windows (PowerShell):

# Check if agent is running
$agent = Get-Process java | Where-Object {$_.CommandLine -like "*epmware-agent.jar*"}
if ($agent) {
    Write-Host "OK: Agent is running (PID: $($agent.Id))"
    exit 0
} else {
    Write-Host "ERROR: Agent is not running"
    exit 1
}

Performance Monitoring

Resource Utilization

Monitor CPU and memory usage of the agent process:

Linux:

# Real-time monitoring
top -p $(pgrep -f epmware-agent)

# Memory usage
ps aux | grep epmware-agent | grep -v grep | awk '{print "Memory: "$4"%"}'

# CPU usage
ps aux | grep epmware-agent | grep -v grep | awk '{print "CPU: "$3"%"}'

Windows (Task Manager): 1. Open Task Manager 2. Find the java.exe process running epmware-agent.jar 3. Monitor CPU and Memory columns

Response Time Monitoring

Track agent response times by analyzing logs:

# Calculate average response time for deployments
grep "Deployment.*completed" agent.log | \
  awk '{print $1, $2}' | \
  while read start_time; do
    # Calculate duration
    echo "Response time calculation..."
  done

Automated Monitoring Scripts

Continuous Monitoring Script

Create a monitoring script that runs continuously:

monitor-agent.sh (Linux):

#!/bin/bash

AGENT_HOME="/home/[username]"
LOG_FILE="$AGENT_HOME/logs/agent-monitor.log"
ALERT_EMAIL="admin@company.com"

while true; do
    # Check if agent is running
    PID=$(ps -ef | grep -i epmware-agent | grep -v grep | awk '{print $2}')

    if [ -z "$PID" ]; then
        echo "$(date): Agent DOWN - Attempting restart" >> $LOG_FILE
        cd $AGENT_HOME
        ./ew_target_service.sh &

        # Send alert
        echo "EPMware Agent down on $(hostname)" | mail -s "Agent Alert" $ALERT_EMAIL
    else
        echo "$(date): Agent UP - PID: $PID" >> $LOG_FILE
    fi

    # Check every 5 minutes
    sleep 300
done

Windows Scheduled Task Monitoring

Create a PowerShell script and schedule it to run every 5 minutes:

Monitor-Agent.ps1:

$agentProcess = Get-Process java -ErrorAction SilentlyContinue | 
    Where-Object {$_.CommandLine -like "*epmware-agent.jar*"}

if (-not $agentProcess) {
    # Agent is not running - restart it
    Write-EventLog -LogName Application -Source "EPMware Agent" `
        -EventId 1001 -EntryType Error `
        -Message "Agent not running - attempting restart"

    # Start the scheduled task
    Start-ScheduledTask -TaskName "EPMWARE TARGET AGENT SERVICE"

    # Send email alert (requires configured SMTP)
    Send-MailMessage -To "admin@company.com" `
        -From "epmware@company.com" `
        -Subject "EPMware Agent Down" `
        -Body "Agent was down on $env:COMPUTERNAME and has been restarted" `
        -SmtpServer "smtp.company.com"
}

Key Metrics to Monitor

Availability Metrics

Uptime percentage - Target: >99.5%
Mean time between failures (MTBF)
Mean time to recovery (MTTR)
Number of restarts required

Performance Metrics

Average response time - Target: <2 seconds
Deployment success rate - Target: >98%
Queue processing time
Memory usage - Alert if >1GB
CPU usage - Alert if >80% sustained

Business Metrics

Deployments per day/hour
Failed deployments
Deployment duration trends
Peak usage times

Alerting Configuration

Log-based Alerts

Monitor specific patterns in log files:

#!/bin/bash
# Alert on errors
tail -F agent.log | while read line; do
    if echo "$line" | grep -q "ERROR\|FATAL"; then
        echo "$line" | mail -s "EPMware Agent Error" admin@company.com
    fi
done

Threshold Alerts

Set up alerts for resource thresholds:

# CPU usage alert
CPU_USAGE=$(ps aux | grep epmware-agent | grep -v grep | awk '{print $3}')
if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
    echo "High CPU usage: $CPU_USAGE%" | mail -s "Agent CPU Alert" admin@company.com
fi

Integration with Monitoring Tools

Nagios Plugin

Create a custom Nagios plugin for agent monitoring:

#!/bin/bash
# check_epmware_agent.sh

# Check if agent is running
PID=$(ps -ef | grep -i epmware-agent | grep -v grep | awk '{print $2}')

if [ -z "$PID" ]; then
    echo "CRITICAL: EPMware Agent is not running"
    exit 2
fi

# Check last poll time (should be within last 60 seconds)
LAST_POLL=$(tail -1 ~/logs/agent-poll.log | awk '{print $1, $2}')
# Add logic to check if within threshold

echo "OK: EPMware Agent is running (PID: $PID)"
exit 0

Zabbix Monitoring

Configure Zabbix items for agent monitoring:

Process monitoring:
Key: proc.num[java,,,epmware-agent.jar]
Trigger: Process count < 1
Log monitoring:
Key: log[/home/username/logs/agent.log,"ERROR"]
Trigger: Error count > 0
Port monitoring:
Monitor agent communication port
Alert on connection failures

Splunk Integration

Forward agent logs to Splunk for advanced analytics:

inputs.conf:

[monitor:///home/*/logs/agent*.log]
sourcetype = epmware_agent
index = epmware

Create Splunk alerts for: - Error patterns in logs - Deployment failures - Connection issues - Performance degradation

Dashboard Creation

Monitoring Dashboard Components

Create a comprehensive monitoring dashboard including:

Agent Status Panel
Current status (Up/Down)
Uptime percentage
Last successful poll time
Performance Metrics
Response time graph
CPU/Memory usage trends
Queue size over time
Deployment Statistics
Success/failure rates
Average deployment duration
Deployments by application
Alert Summary
Recent errors
Critical alerts
Warning notifications

Troubleshooting Monitoring Issues

Agent Shows Offline but Is Running

Check network connectivity:

ping epmware-server.com
telnet epmware-server.com 443

Verify agent configuration:
```
grep "ew.portal" agent.properties
```
Check firewall rules:
```
sudo iptables -L | grep 443
```

False Alerts

Adjust polling intervals in monitoring scripts
Increase timeout thresholds
Implement alert suppression during maintenance

Missing Metrics

Verify log file permissions
Check disk space for log storage
Ensure monitoring scripts have proper execution rights

Best Practices

Establish Baselines
Document normal performance metrics
Set realistic thresholds based on baselines
Review and adjust thresholds quarterly
Implement Redundancy
Use multiple monitoring methods
Configure backup alerting channels
Test failover procedures regularly
Document Procedures
Create runbooks for common issues
Document escalation procedures
Maintain contact lists for alerts
Regular Testing
Test monitoring scripts monthly
Verify alert delivery
Conduct failure scenario drills