Monitoring in Linux¶

There are a lot of ways to monitor Linux systems.
Manual inspection of system resource usage and logs is valid for one or two systems, but tedious and not valid at scale.
This is why monitoring tools exist.

Most 3rd party tools will be installed in /opt.
E.g., Install Loki in /opt/loki

Table of Contents¶

Monitoring Tools for Linux Systems
Grafana + Loki + Promtail Monitoring Stack
Installing and Configuring Grafana
- Resetting the admin Password for Grafana
Installing and Configuring Loki
- Linking Loki to Grafana
Setting up Promtail
Grafana Dashboard Templates
Setting up InfluxDB2
Setting up Telegraf
Setting up node_exporter
- node_exporter sysconfig file
- node_exporter textfile_collector
Setting up Prometheus
Grafana Provisioning: Automatic Data Source and Dashboard Provisioning
- Grafana Data Source Provisioning
- Dashboard Provisioning
  - Changing Provisioned Dashboards
Resources

Monitoring Tools for Linux Systems¶

Tons of monitoring tools exist out there.

For both logs and system metrics

Aggregating data¶

node_exporter: Collects hardware and OS metrics (CPU, memory, disk usage, and network stats).
loki: Lightweight log aggregation tool, works well with Grafana to centralize and query logs.
telegraf: An agent for collecting, processing, and aggregating metrics and events.
- Supports many plugins for collecting metrics from services like MySQL, Docker, and Nginx.
fluentd: Log collector and processor.
- Used to aggregate logs from different sources and send them to a central location (like Loki or Elasticsearch).
collectd: Daemon for collecting system performance metrics.
- Can output to various destinations, incl RRD (Round Robin Database), InfluxDB, or Graphite.

Storing data (Time Series Database)¶

prometheus: Popular TSDB focused on storing metrics scraped from exporters.
- Has a powerful query language (PromQL) for alerting and analysis.
influxdb: TSDB optimized for metrics and events.
- InfluxDB is often paired with telegraf for collecting and storing data.
elasticsearch: A search and analytics engine often used for storing and querying logs.
- Works well with tools like Kibana for log vizualization.
opentsdb: A scalable TSDB designed for high-throughput metrics storage.

Data visualization¶

grafana: Highly customizable vizualization tool for displaying metrics and logs.
- Integrates with Prometheus, Loki, InfluxDB, Elasticserarch, and more tools.
kibana: Vizualization tool that works with Elasticsearch.
- Best for querying and analyzing log data.
victoriametrics Dasboards: Vizualization and analysis tool for time-series data stored in VictoriaMetrics.
- Similar functionality to Prometheus+Grafana.

Real-time Monitoring and Alerts¶

Real-time system and application monitoring:
nagios: Classic tool for monitoring systems and infrastructure.
* Provides real-time alerts based on thresholds. zabbix: Robust monitoring system for networks, servers, and applications.
* Inlucdes bultin alerting and vizualization tools.
monit: Lightweight monitoring and process management tool. * Can automatically restart failing services.
glances: Terminal-based real-time system monitoring tools.
* Displays metrics like CPU, memory, and disk usage with a user-friendly interface.
htop: Interactive process viewer. Similar to top but more visual and user-friendly.

Cloud-Native Monitoring Tools¶

Specifically designed for containerized and distributed environments:
promtail: A log collection agent that works with Loki in cloud-native environments.
cAdvisor: Monitors container resource usage and performance. * Often used with Docker and Kubernetes.
kubectl top: Part of the Kubernetes CLI.
* Provides real-time resource usage metrics for pods and nodes.
thanos: Highly available, scalable, and long-term storage solution for Prometheus metrics.

Tools Specialized for Logging¶

Specifically for log aggregation, processing, and storage:
rsyslog: A log collector and forwarder built into many Linux distros.
* Can send logs to remote servers or central aggregation systems.
logstash: Data processing pipeline for ingesting and enriching logs.
* Works with Elasticsearch and Kibana (ELK stack).
* journald: System service for managing an querying logs on modern Linux systems. * This is part of systemd, can be accessed with journalctl.

Grafana + Loki + Promtail Monitoring Stack¶

Installing and Configuring Grafana¶

Grafana docs for installation ¶

Install the required packages and the Grafana GPG key

sudo apt-get install -y \
apt-transport-https \
software-properties-common \
wget  

sudo wget -q -O /usr/share/keyrings/grafana.key https://apt.grafana.com/gpg.key

Add the Grafana apt repository.

echo "deb [signed-by=/usr/share/keyrings/grafana.key] https://apt.grafana.com  
stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

Now, install Grafana.

sudo apt update && sudo apt install grafana-enterprise -y

Make sure Grafana is started.

sudo systemctl daemon-reload  
sudo systemctl start grafana-server  
sudo systemctl status grafana-server --no-pager

Verify that the Grafana server is serving on port 3000 (which is the default).

systemctl status grafana-server --no-pager  
ss -ntulp 
ss -ntulp | grep -i 'grafana'  
ss -ntulp | grep 3000

Default port for Grafana is 3000.

Check that the external Web UI is available.
Open a browser and go to the machine's IP on port 3000.
```
hostname -I | awk '{ print $1 }' # Get the IP if you don't know it  
```
Then go to https://<your-ip>:3000.
Once you're at the Web UI, login with the defaults (admin for both username and
password), and change the password when prompted.

Resetting the admin Password for Grafana¶

In case you screw up and lock yourself out of the Grafana server:

sudo grafana-cli admin reset-admin-password admin

This resets the admin user account's password to admin (the default).

Installing and Configuring Loki¶

Create a directory to install Loki
```
mkdir /opt/loki && cd /opt/loki  
```

Download and unpackage a current version of loki (see the Github releases).

curl -L -O "https://github.com/grafana/loki/releases/download/v2.9.7/loki-linux-amd64.zip"  
unzip loki-linux-amd64.zip  
chmod a+x loki-linux-amd64

Make a Loki config file in the directory /opt/loki/loki-local-config.yaml.
- See local loki config file.
- See the loki.service file that should be at /etc/systemd/system/loki.service
- Once the service file is there, do a daemon-reload and enable Loki.
```
sudo systemctl daemon-reload  
sudo systemctl enable loki --now  
# Verify  
sudo systemctl status loki 
ps -ef | grep -i 'loki'  
```

Default port for Loki is 3100.

Linking Loki to Grafana¶

Go to Grafana and create the data source for Loki in the the "Data source" page (Connections > Data sources).
Select Loki from the "sources", and enter the URL.
The URL to put in to link Loki: http://127.0.0.1:3100 The actual numeric IP is usually preferred over localhost, just to be sure nothing happens if DNS doesn't resolve correctly.

Create a dashboard (import -> enter ID 13639 for a Loki preset dashboard) that shows the log files for your server.

Setting up Promtail¶

Configure Promtail to push logs from /var/log/auth.log and /var/log/syslog off the server to the Loki aggregator.

Create the directory to install Promtail in.

mkdir /opt/promatil && cd /opt/promtail

Download and extract the Promtail executable (check the releases)

curl -L -O "https://github.com/grafana/loki/releases/download/v2.7.1/promtail-linux-amd64.zip"  
unzip promtail-linux-amd64.zip

You'll also need a config file here /opt/promtail/promtail-local-config.yaml.
- See example config.
- See the promtail.service file file that should be
  copied to /etc/systemd/system/promtail.service.

Once the service file is there, do a daemon-reload and start promtail.

sudo systemctl daemon-reload  
sudo systemctl enable promtail.service --now  
# Verify it's running  
systemctl status promtail.service --no-pager  
ps -ef | grep -i promtail

Grafana Dashboard Templates¶

Grafana has dashboard templates that allow you to quickly set up a dashboard for a
service.

Some template numbers:
159: Prometheus template
13639: Loki (log) template

Setting up InfluxDB2¶

# Get the GPG key  
wget -q https://repos.influxdata.com/influxdata-archive_compat.key  

# Dearmor the key and save it to /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg  
echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && cat influxdata-archive_compat.key | gpg --dearmor | tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null  

# Add the repo and use the key to sign for the repo  
echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | tee /etc/apt/sources.list.d/influxdata.list  

# Install InfluxDB  
sudo apt update && sudo apt-get install -y influxdb2

sudo systemctl start influxdb  
sudo systemctl enable influxdb

InfluxDB listens on port 8086

ss -ntulp | grep 8086  
lsof -i :8086

Go to the web UI at https://127.0.0.1:8086 and set up an account, organization, bucket, and copy the token that is given. It looks like a base64 encoded string.

It'll look like:

bBInHhXJ8z6VOz4kbyr_mrvl25AWk__8HxzTkyGl33AMZlYXVp8kHui0SDhbLUC9w5aVJY_O3GY3pp6qaPSmXA==

Default port for InfluxDB2 is 8086.

Connect InfluxDB to Grafana¶

Go to the Grafana web UI and go to "Add a Datasource."

Select InfluxDB and fill in the information:

Select Flux as the query language.
Add username/Password
Add organization name
Add bucket name
Add the token from the InfluxDB web UI

Setting up Telegraf¶

Telegraf is a very versatile tool that allows you to both collect data and send it places.
Often paired with InfluxDB.
It has four main functions:
* Input * Output * Process * Aggregation

Telegraf can send its output to many different platforms: InfluxDB / Azure Monitor / Google Cloud PubSub / Cloudwatch / elasticsearch / graphite, etc etc

# Get the key  
wget -q https://repos.influxdata.com/influxdata-archive_compat.key  

# Dearmor the key and save it in /etc/apt/trusted.gpg.d/influxdata-  
echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && cat influxdata-archive_compat.key | gpg --dearmor | tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null  

# Set up the repository  
echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | tee /etc/apt/sources.list.d/influxdata.list  

# Install Telegraf  
sudo apt update && sudo apt-get install -y telegraf

You need to set up the telegraf configuration file to write to the "output producer"
for influxdb2:

vi /etc/telegraf/telegraf.conf

Then add your token, bucket, and organization:

# # Configuration for sending metrics to InfluxDB 2.0  
 [[outputs.influxdb_v2]]  
#   ## The URLs of the InfluxDB cluster nodes.  
#   ##  
#   ## Multiple URLs can be specified for a single cluster, only ONE of the  
#   ## urls will be written to each interval.  
#   ##   ex: urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]  
   urls = ["http://127.0.0.1:8086"]  
#  
#   ## Token for authentication.  
   token = "bBInHhXJ8z6VOz4kbyr_mrvl25AWk__8HxzTkyGl33AMZlYXVp8kHui0SDhbLUC9w5aVJY_O3GY3pp6qaPSmXA=="  
#  
#   ## Organization is the name of the organization you wish to write to.  
   organization = "killerkodalab"  
#  
#   ## Destination bucket to write into.  
   bucket = "killerkodalab"

Make sure it's writing out to InfluxDB2:

systemctl restart telegraf  
systemctl status telegraf --no-pager -l

Setting up node_exporter¶

To install node_exporter, download the tarball from the Github releases and
extract it.

Make a directory for the installation, /opt/node_exporter

mkdir /opt/node_exporter  

# Download into /opt/node_exporter  
curl -sSL -o /opt/node_exporter/node_exporter-1.8.2.linux-amd64.tar.gz  \
https://github.com/prometheus/node_exporter/releases/download/v1.8.2/node_exporter-1.8.2.linux-amd64.tar.gz  

# Extract into /opt/node_exporter  
tar -xvfz /opt/node_exporter/node_exporter-*.*-amd64.tar.gz -C /opt/node_exporter

Get the default configuration files using git:

git clone https://github.com/prometheus/node_exporter.git /opt/node_exporter/node_exporter-1.8.2/config_files  
# NOTE: You can't use wildcards in git clone. Use the exact version directory.

Copy some of the example files from this repo.

Copy the unit files node_exporter.service and node_exporter.socket to the systemd directory.

cp /opt/node_exporter/node_exporter-*.*-amd64/config_files/examples/systemd/node_exporter.service /etc/systemd/system/  
cp /opt/node_exporter/node_exporter-*.*-amd64/config_files/examples/systemd/node_exporter.socket /etc/systemd/system/node_exporter.socket

The files in /etc/systemd/system named *.socket/*.service are called unit files.

Copy the node_exporter binary to /usr/sbin/.
```
cp node_exporter /usr/sbin/  
```

OPTIONAL: Copy the sysconfig.node_exporter file into /etc/sysconfig/ as node_exporter.

cp /opt/node_exporter/node_exporter-*.*-amd64/config_files/examples/systemd/sysconfig.node_exporter /etc/sysconfig/node_exporter

See more about the sysconfig file

OPTIONAL: Create a directory named /var/lib/node_exporter/textfile_collector:
```
mkdir -p /var/lib/node_exporter/textfile_collector  
```
- See [more about the textfile_collector]

Create a user account for node_exporter and give it ownership of the sysconfig and the textfile_collector.

useradd -s /sbin/nologin node_exporter  
chown -R node_exporter:node_exporter /var/lib/node_exporter/textfile_collector /etc/sysconfig/node_exporter

Do a daemon-reload and enable node_exporter:

systemctl daemon-reload  
systemctl enable node_exporter.service --now

See if node_exporter is running and exposing the right port (9100):

systemctl status node_exporter --no-pager  
sleep 2
curl http://localhost:9100/metrics

Default port for node_exporter is 9100.

node_exporter `sysconfig` file¶

If you don't need customized monitoring and the default is fine, you don't need
this file.

This is a completely optional file.
The sysconfig file is used to configure node_exporter.
It allows you to enable/disable certain collectors, so you can control what metrics
are collected.

Instead of passing flags directly when starting node_exporter, you can store
them in the sysconfig file, which the service manager (systemd) reads.

For example, disable all collectors except CPU, Memory, and Filesystem collectors:

OPTIONS="--collector.cpu --collector.meminfo --collector.filesystem"

This file would be * /etc/sysconfig/node_exporter on RedHat family systems (RHEL/CentOS/Fedora)
* /etc/default/node_exporter on Debian family systems (Debian/Ubuntu/Mint).

This path can be changed in the service file if you want.
In /etc/systemd/system/node_exporter.service:

EnvironmentFile=/etc/sysconfig/node_exporter  # RHEL  
EnvironmentFile=/etc/default/node_exporter    # Debian

node_exporter textfile_collector¶

If you don't need customized monitoring and the default is fine, you don't need
this file.

The textfile_collector is used to gather metrics that aren't natively supported
by node_exporter.

Like, if you generate metrics with cron jobs, scripts, or other programs, you can
expose them using the textfile_collector.

You'd place metric files in /var/lib/node_exporter/textfile_collector.
A metric file should look like this:

# HELP my_custom_metric A custom metric for testing  
# TYPE my_custom_metric counter  
my_custom_metric 123

Then node_exporter will expose these metrics along with its standard metrics.

Setting up Prometheus¶

Just like node_exporter, you also need to download and extract the tarball for Prometheus.

Create a directory for prometheus
```
mkdir /var/lib/prometheus  
```

Download and extract Prometheus.

curl -o /tmp/prometheus-2.42.0-rc.0.linux-amd64.tar.gz \
    https://github.com/prometheus/prometheus/releases/download/v2.42.0-rc.0/prometheus-2.42.0-rc.0.linux-amd64.tar.gz  
tar -xvfz /tmp/prometheus-2.42.0-rc.0.linux-amd64.tar.gz -C /var/lib/prometheus/ --strip-components=1

Copy the binary into /usr/bin

cp /var/lib/prometheus/prometheus /usr/bin/prometheus

Add a user account for Prometheus and give it the /var/lib/prometheus directory.

useradd prometheus  
chown -R prometheus:prometheus /var/lib/prometheus

Create a directory in /etc for prometheus configuration.

mkdir /etc/prometheus  
cp prometheus.yml /etc/prometheus/prometheus.yml

See example prometheus.yml file

Move the prometheus.service file into /etc/systemd/system/
```
cp prometheus.service /etc/systemd/system/prometheus.service  
```
- see example prometheus service file

Do a daemon-reload and start the Prometheus service.

systemctl daemon-reload  
systemctl start prometheus

Make sure prometheus is running.

systemctl status prometheus  
ss -ntulp | grep 9090

Default port for Prometheus is 9090.

Grafana Provisioning: Automatic Data Source and Dashboard Provisioning¶

Grafana supports provisioning data sources and dashboards through config files.

These are yaml files that can be included in an Ansible playbook or placed manually
in the Grafana server's configuration directory.

These would go in /etc/grafana/provisioning/datasources (assuming your grafana
config location is /etc/grafana).

You can also provision plugins. And, if you want to, run multiple Grafana instances.
See the docs

Grafana Data Source Provisioning¶

Grafana provisioning directory: /etc/grafana/provisioning/datasources/ To provision prometheus, I could create a prometheus.yaml in this directory:

# /etc/grafana/provisioning/datasources/prometheus.yaml  
apiVersion: 1  
datasources:  
  - name: Prometheus  
    type: prometheus  
    access: proxy  
    url: http://localhost:9090  
    isDefault: true  
    editable: false

Below is a more robust example from the Grafana docs:

# Configuration file version  
apiVersion: 1  

# List of data sources to delete from the database.  
deleteDatasources:  
  - name: Graphite  
    orgId: 1  

# Mark provisioned data sources for deletion if they are no longer in a provisioning file.  
# It takes no effect if data sources are already listed in the deleteDatasources section.  
prune: true  

# List of data sources to insert/update depending on what's  
# available in the database.  
datasources:  
  # <string, required> Sets the name you use to refer to  
  # the data source in panels and queries.  
  - name: Graphite  
    # <string, required> Sets the data source type.  
    type: graphite  
    # <string, required> Sets the access mode, either  
    # proxy or direct (Server or Browser in the UI).  
    # Some data sources are incompatible with any setting  
    # but proxy (Server).  
    access: proxy  
    # <int> Sets the organization id. Defaults to orgId 1.  
    orgId: 1  
    # <string> Sets a custom UID to reference this  
    # data source in other parts of the configuration.  
    # If not specified, Grafana generates one.  
    uid: my_unique_uid  
    # <string> Sets the data source's URL, including the port.  
    url: http://localhost:8080  
    # <string> Sets the database user, if necessary.  
    user:  
    # <string> Sets the database name, if necessary.  
    database:  
    # <bool> Enables basic authorization.  
    basicAuth:  
    # <string> Sets the basic authorization username.  
    basicAuthUser:  
    # <bool> Enables credential headers.  
    withCredentials:  
    # <bool> Toggles whether the data source is pre-selected for new panels. 
    # You can set only one default data source per organization.  
    isDefault:  
    # <map> Fields to convert to JSON and store in jsonData.  
    jsonData:  
      # <string> Defines the Graphite service's version.  
      graphiteVersion: '1.1'  
      # <bool> Enables TLS authentication using a client  
      # certificate configured in secureJsonData.  
      tlsAuth: true  
      # <bool> Enables TLS authentication using a CA  
      # certificate.  
      tlsAuthWithCACert: true  
    # <map> Fields to encrypt before storing in jsonData.  
    secureJsonData:  
      # <string> Defines the CA cert, client cert, and  
      # client key for encrypted authentication.  
      tlsCACert: '...'  
      tlsClientCert: '...'  
      tlsClientKey: '...'  
      # <string> Sets the database password, if necessary.  
      password:  
      # <string> Sets the basic authorization password.  
      basicAuthPassword:  
    # <int> Sets the version. Used to compare versions when  
    # updating. Ignored when creating a new data source.  
    version: 1  
    # <bool> Allows users to edit data sources from the  
    # Grafana UI.  
    editable: false

Dashboard Provisioning¶

Create a provisioning file for dashboards, like:

/etc/grafana/provisioning/dashboards/default.yaml

This yaml file will have a path key that points to where the actual json files for the dashboards are stored.

# /etc/grafana/provisioning/dashboards/default.yaml  
apiVersion: 1  
providers:  
  - name: 'default'  
    orgId: 1  
    folder: ''  
    type: file  
    options:  
      path: /var/lib/grafana/dashboards

The dashboards themselves go in /var/lib/grafana/dashboards/, in json format (or wherever you specify the path:).

More robust example from the Grafana docs:

apiVersion: 1  

providers:  
  # <string> an unique provider name. Required  
  - name: 'a unique provider name'  
    # <int> Org id. Default to 1  
    orgId: 1  
    # <string> name of the dashboard folder.  
    folder: ''  
    # <string> folder UID. will be automatically generated if not specified  
    folderUid: ''  
    # <string> provider type. Default to 'file'  
    type: file  
    # <bool> disable dashboard deletion  
    disableDeletion: false  
    # <int> how often Grafana will scan for changed dashboards  
    updateIntervalSeconds: 10  
    # <bool> allow updating provisioned dashboards from the UI  
    allowUiUpdates: false  
    options:  
      # <string, required> path to dashboard files on disk. Required when using the 'file' type  
      path: /var/lib/grafana/dashboards  
      # <bool> use folder names from filesystem to create folders in Grafana  
      foldersFromFilesStructure: true

Changing Provisioned Dashboards¶

NOTE: Any changes made to provisioned dashboards through the UI can't be saved back
to the provisioning source (the json file that it is read from).

If allowUiUpdates is set to true and you make a change to a provisioned
dashboard, you can Save the dashboard and changes will persist to the Grafana
database.
This will not save changes to the original json file.
But, if the dashboard is reloaded from the source then any changes will be overwritten.