Monitoring
Monitor your DataMgmt Node deployment for health, performance, and security.
Health Checks
Endpoints
Both APIs expose health check endpoints:
| Endpoint | Port | Description |
|---|---|---|
GET /health |
8080 | Internal API health |
GET /health |
8081 | External API health |
Internal API Health
{
"status": "healthy",
"components": {
"blockchain": "connected",
"p2p_network": "running",
"encryption": "initialized"
},
"version": "0.1.0"
}
External API Health
Health Status Values
| Status | Description |
|---|---|
healthy |
All components operational |
degraded |
Some components have issues |
unhealthy |
Critical failure |
Metrics
Network Statistics
{
"total_peers": 15,
"healthy_peers": 12,
"data_sent": 10485760,
"data_received": 5242880,
"uptime": 86400
}
Key Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
healthy_peers |
Connected healthy peers | < 3 |
total_peers |
Total known peers | < 5 |
uptime |
Node uptime in seconds | - |
data_sent |
Bytes sent | - |
data_received |
Bytes received | - |
Logging
Log Configuration
Logs are written to stdout with the format:
Log Levels
| Level | Description |
|---|---|
DEBUG |
Detailed debugging information |
INFO |
Normal operation events |
WARNING |
Unexpected but handled events |
ERROR |
Errors that need attention |
Important Log Messages
Startup:
INFO - Configuration validated successfully
INFO - Node started successfully
INFO - P2P Network started on port 8000
Peer Events:
INFO - Connected to peer 192.168.1.10:8000
WARNING - Peer 10.0.0.5:8000 marked unhealthy
INFO - Discovered 5 new peers
Data Operations:
Errors:
ERROR - Failed to connect to blockchain: Connection refused
ERROR - Authorization verification failed: Invalid signature
WARNING - Rate limit exceeded for 192.168.1.100
Prometheus Integration
Expose Metrics
Add a metrics endpoint (custom implementation required):
# Example metrics endpoint
from prometheus_client import Counter, Gauge, generate_latest
requests_total = Counter('datamgmt_requests_total', 'Total requests', ['endpoint'])
peers_connected = Gauge('datamgmt_peers_connected', 'Connected peers')
data_operations = Counter('datamgmt_data_operations', 'Data operations', ['type'])
Prometheus Configuration
# prometheus.yml
scrape_configs:
- job_name: 'datamgmt-node'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
Grafana Dashboard
Import a dashboard with these panels:
- Node Health - Health status over time
- Peer Connections - Connected vs healthy peers
- Data Operations - Shares, retrievals, verifications
- API Latency - Request response times
- Error Rate - Errors by type
Alerting
Alert Rules
# alerts.yml
groups:
- name: datamgmt
rules:
- alert: NodeUnhealthy
expr: datamgmt_health_status != 1
for: 5m
labels:
severity: critical
annotations:
summary: "DataMgmt node is unhealthy"
- alert: LowPeerCount
expr: datamgmt_peers_connected < 3
for: 10m
labels:
severity: warning
annotations:
summary: "Low peer count: {{ $value }}"
- alert: HighErrorRate
expr: rate(datamgmt_errors_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
- alert: BlockchainDisconnected
expr: datamgmt_blockchain_connected == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Blockchain connection lost"
Notification Channels
Configure alerts to:
- Slack
- PagerDuty
- OpsGenie
Log Aggregation
Fluentd Configuration
<source>
@type tail
path /var/log/datamgmt/*.log
pos_file /var/log/fluentd/datamgmt.pos
tag datamgmt
<parse>
@type regexp
expression /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) - (?<module>\S+) - (?<level>\S+) - (?<message>.*)$/
</parse>
</source>
<match datamgmt.**>
@type elasticsearch
host elasticsearch
port 9200
index_name datamgmt
</match>
ELK Stack Queries
Search for errors:
Search for specific operations:
Uptime Monitoring
External Monitoring
Configure external monitoring services:
# Health check endpoint
https://api.datamgmt.example.com/health
# Expected response
{"status": "healthy", ...}
Uptime Checklist
- [ ] Health endpoint accessible externally
- [ ] Response time < 1 second
- [ ] Status returns "healthy"
- [ ] Check interval: 1 minute
- [ ] Alert after 3 failures
Troubleshooting
Common Issues
Node won't start:
No peers connecting:
High memory usage:
# Check process memory
ps aux | grep datamgmt
# Monitor over time
watch -n 5 'ps -o rss,vsz,pid,cmd -p $(pgrep -f datamgmt)'
Debug Mode
Enable debug logging:
Next Steps
- Security Guide - Security monitoring
- Deployment Guide - Production setup