Troubleshooting Common Issues with Cloudiff Monitor Agent
Cloudiff Monitor Agent helps collect metrics and logs from servers and applications. When it misbehaves, follow this structured troubleshooting guide to identify and fix the most common problems quickly.
1. Agent not running or crashing
- Check service status
- Linux:
sudo systemctl status cloudiff-agent - Windows: check Services → Cloudiff Monitor Agent.
- Linux:
- Restart the agent
- Linux:
sudo systemctl restart cloudiff-agent - Windows: restart the service from Services or run
net stop cloudiff-agent && net start cloudiff-agent.
- Linux:
- Check logs (see section 3).
- Common fixes
- Ensure the agent binary has execute permissions.
- Verify the system meets minimum requirements (RAM, disk space).
2. Agent cannot connect to the collector / backend
- Verify network connectivity
- Ping the collector hostname or use
curl/telnetto the collector port.
- Ping the collector hostname or use
- Check DNS resolution
- Use
digornslookupto resolve the collector hostname.
- Use
- Firewall / security groups
- Ensure outbound port (usually 443 or configured port) is open on host firewall and cloud security groups.
- Proxy configuration
- If your environment uses a proxy, confirm agent proxy settings (HTTP_PROXY / HTTPS_PROXY or agent config).
- TLS / certificate errors
- Confirm system time is correct (TLS fails with wrong clock).
- If using custom CA, ensure the agent trusts the CA or has access to the CA bundle.
3. Inspecting and interpreting logs
- Log locations
- Linux: /var/log/cloudiff-agent/.log (or configured log directory)
- Windows: %ProgramData%\Cloudiff\logs\ or Event Viewer under Application
- What to look for
- Authentication failures (invalid API key, token expiration)
- Network timeouts or “connection refused”
- Configuration parse errors
- Plugin/module load failures
- Tips
- Search for repeated ERROR or WARN entries with timestamps.
- Enable debug logging in agent config for more detail, reproduce the issue, then revert logging level.
4. Authentication and API key issues
- Validate credentials
- Confirm the API key/token in the agent config matches the one in the control plane.
- Key rotation
- If keys were rotated, update the agent and restart.
- Permission scope
- Ensure the key has required permissions for metrics/log ingestion.
- Clock skew
- Some token systems are time-sensitive—sync host time with NTP.
5. High memory or CPU usage
- Identify resource-hungry components
- Check which plugins or collectors are active.
- Adjust collection interval
- Increase polling intervals for heavy collectors.
- Limit local buffering
- Large buffers can increase memory; tune buffer sizes in config.
- Upgrade agent or host
- Ensure you run a supported agent version and the host meets resource needs.
6. Metric or log gaps / missing data
- Confirm collection config
- Verify enabled integrations, paths, and service discovery rules.
- Check agent and collector time alignment
- Time mismatch can make data appear missing.
- Rate limits / throttling
- Backend rate limiting can drop data; check backend error logs or dashboards for throttling indicators
- Local file rotation
- For log collection, ensure log rotation doesn’t move files out from under the agent without notifying it.
7. Integration/plugin specific errors
- Common actions
- Verify integration credentials and connection strings.
- Confirm correct plugin version for your agent.
- Test connecting to the target service directly from the host (e.g., database, API).
- Re-enable modules one at a time to isolate the failing integration.
8. Corrupted configuration or bad YAML/JSON
- Validate config syntax
- Use a YAML/JSON linter or
cloudiff-agent –configtestif available.
- Use a YAML/JSON linter or
- Rollback recent changes
- Revert to the last known-good config and restart the agent.
- Keep backups
- Maintain versioned copies of config files.
9. Upgrading and version compatibility problems
- Check release notes
- Review agent changelog for breaking changes before upgrading.
- Staged rollout
- Upgrade a subset of hosts first and monitor behavior.
- Fallback plan
- Keep the previous installer/package handy to rollback quickly.
10. When to contact support
- Gather and provide:
- Agent version, OS, and architecture.
- Agent logs (last 200–1,000 lines).
- Exact error messages and timestamps.
- Relevant config snippets (redact secrets).
- Network connectivity test results (ping, curl, telnet).
- Use the control plane support channel or your internal ops workflow.
Quick checklist (summary)
- Confirm service is running; restart if needed.
- Check agent logs for ERROR/WARN lines.
- Verify network, DNS, and firewall settings.
- Validate API keys and permissions.
- Ensure host time is correct and system resources are sufficient.
- Test integrations and adjust collection intervals.
- Validate configuration syntax and rollback if necessary.
- Collect diagnostics before contacting support.
Following these steps will resolve most Cloudiff Monitor Agent issues. If the problem persists after the checklist, gather diagnostics and*
Leave a Reply