Close
Enhancing Zeek data with AI/ML anomaly detection data

Enhancing Zeek data with AI/ML anomaly detection data


This blog is part of the VTPP (VNET Threat Perception Platform) project, a three-year programme co-funded by the European Commission under the DIGITAL-ECCC-2022-CYBER-03 call. The project covers DDoS mitigation with FastNetMon, vulnerability scanning with OpenVAS, custom AI/ML detection plugins for Zeek, HSM-backed key management, RPKI validation and a Krill CA, and a full-scale deployment of Security Onion as the IDS/SIEM/NSM backbone.

Building up on previous successful project MINERWA focused on providing machine-learning based detection of malicious traffic and AI based detection of anomalies in traffic, we decided to move it one step further and use these data to enhance output of Zeek and use these enhanced logs within Security Onion platform. Establishing Security Onion to be used as SIEM tool in VNET and leveraging MINERWA project to enhance Zeek data are two of the goals set in VTPP (VNET Threat Perception Platform) project.

VTPP builds on top of MINERWA by integrating AI/ML-enriched anomaly data directly into Zeek logs and ingesting them into Security Onion, providing a unified, AI-augmented SIEM platform for network threat detection at VNET.

Brief history of MINERWA

Between 2020 and 2023, VNET participated in CEF Telecom EU programme with MINERWA – network awareness and early anomaly detection system project. Aim of this project was to build a system with adaptive anomaly detection in network traffic based on AI, that would help detect various kinds of potential attacks even on encrypted traffic based on behavioral analysis of network nodes.
As a data source for analysis we have chosen IPFIX flows, that provides effective and scalable way of collecting network traffic metadata and we have already had a good experience with Netflow v5 and Netflow v9 before. Unlike before, for MINERWA project we used special tool for capturing traffic and exporting IPFIX data, nProbe, that provides much more metadata and statistics than common, even enterprise, routers.
Having IPFIX data ready, we chose two ways of processing them. Since a lot of malicious traffic represents common, well known patterns (ICMP scans, TCP port scans, TCP floods, etc.) that are already being often easily blocked by firewalls and do not cause much harm unless having high volume, therefore we have trained machine learning classifiers to detect these patterns first. If a flow is classified as malicious using this classifier, it is reported immediately and does not proceed to further anomaly detection, which saves resources from being otherwise consumed by AI inference and gives specific information regarding type of malicious traffic, so it most probably does not need to be analyzed by domain expert (system/network engineer). Rest of the flows continue to AI anomaly detection using autoencoder-based models trained on baseline traffic (common benign traffic) that detect deviations from this baseline.
All flows along with prospective detected anomalies are stored in ClickHouse database for further use (notifications, dashboards, Zeek logs enhancement). See simplified diagram of MINERWA architecture relevant for Zeek logs enhancement in Fig. 1.

Fig. 1: Simplified architecture of MINERWA

Security Onion platform

As described by its author, Security Onion is a free and open platform for threat hunting, network security monitoring, and log management. Beside other features, it mainly integrates free tools like Zeek (network monitor), Suricata (intrusion detection), ELK stack (Elasticsearch, Logstash and Kibana for storing, processing and visualizing data) into a single platform ready to be deployed as standalone node or as distributed system. It provides central management, configuration and monitoring of its components and scaling parts according to performance needs (scaling of Elasticsearch nodes, distributing of network traffic data collection, etc.). Security Onion is often used as a free alternative to commercial SIEM tools. Due to its openness, it can be extended in various ways. One that we have chosen in VTPP project is enhancing Zeek logs by AI/ML anomaly detection data from MINERWA.

Building upon success

While we have successfully finished MINERWA project, we craved for further integration with some more widely used system. While it is possible to be used as a standalone tool, its biggest value is in its AI detection core. Therefore we decided to build upon that and use resulting data in much more mature and globally used tool – Zeek / Security Onion.

Zeek

Zeek is an open-source network security monitoring (NSM) platform that translates raw network packets into structured, high-fidelity logs. Unlike traditional intrusion detection systems (IDS) that rely on static signatures to shout „Alert!“, Zeek acts as a silent archivist. It watches everything passing through the wire, interprets complex protocols (like HTTP, DNS, or SSH), and organizes that data into clear, tab-separated logs. For threat hunters and forensic analysts, Zeek is the gold standard for understanding exactly what happened on a network.

In light of its open-source nature, Zeek is highly customizable by writing scripts using its own domain-specific language. Moreover, if scripting capabilities are not sufficient, custom C++ plugin could be implemented to add virtually any capability. This is the killer feature, that we leveraged to interconnect Zeek with MINERWA. Zeek’s scripting extensibility is based on what they call frameworks. There are several frameworks each for specific part of processing data (File Analysis Framework, Storage Framework, Input Framework, etc.).

Main focus of Zeek is to process network traffic data. Therefore, main input data channels are either network interface that Zeek is capturing traffic on, or pre-captured traffic from PCAP file that could be fed to it. Captured data are by default processed and outputted as a set of tab separated text log files, that could be further processed by other tools, e.g. sent to Elasticsearch and processed by other Security Onion tools. However, it’s often good to make some pre-processing closer to the source and eventually change the processing flow or filter data sooner based on findings.

New input channel of Zeek

Beside main input data stream in form of captured networking data, Zeek provides an Input Framework that allows ingesting additional data from external data sources, like public IP reputation lists, malicious file hashes, etc. Captured traffic could then be immediately compared against such lists and logs could be enriched with respective information. There are several methods of getting additional inputs already built in (e.g. DNS query, text file, SQLite database). Latter named are implemented using Input Framework. Since MINERWA is using ClickHouse database and we didn’t want to use another text file or SQL just as middleware interface to Zeek, we decided to implement our custom third-party plugin that extends Input Framework with ClickHouse support.

This plugin – zeek-clickhouse – is one of the key open-source deliverables of the VTPP project, available for the community to reuse and extend.

Having custom plugin for reading data from ClickHouse save us resources by not having to operate any middleware data store, introducing more potential point of failures and since plugin is written in C++ and ClickHouse is OLAP database specialized on storing and quick querying of large volumes of data, whole interconnection is quick and low resource-consuming. We further optimized performance by periodically loading recent anomalies into Zeek script table and comparing Zeek connections against this table instead looking up each connection in ClickHouse or other data source.

Zeek & Security Onion deployment

There are several possible ways of deploying and running both Zeek and Security Onion based on the needs of particular environment. Both of them have two basic modes – standalone and distributed. In both cases standalone mode is mostly good only for testing or very simple use-cases, while the distributed setup allows scaling particular components based on architectural or performance needs. You can read about how we run Security Onion in separate article.

Zeek on Security Onion runs in standalone mode, because on sensor nodes we have bonded network interfaces. One limitation and reason that prevents using it in standalone setup is that single instance of Zeek can only monitor single network interface. Therefore, if one wants to monitor multiple interfaces either separate instance of Zeek must be run for each interface or a virtual interface must be set-up that merges the traffic. Running multiple Zeek instances might, however, have performance benefits. Moreover, ZeekControl tool simplifies even distributed setup to few simple configuration files and zeekctl based commands, so it is almost as easy as running single instance and more simple and organized than running multiple instances of Zeek directly.

ZeekControl takes care of running and configuring all configured instances of Zeek, collecting and rotating logs on one place and more.

Beside Security Onion managed sensor nodes, we run separate server that runs MINERWA project and Zeek. This node is now not managed by Security Onion, but we wanted to work with data processed by this server in Security Onion Console. Therefore, we chose way of partial integration, i.e. running non-managed Zeek and Elastic Agent managed by Fleet Server within Security Onion. Centrally managed Elastic Agent forwards Zeek logs to Logstash running on Security Onion manager server, that processes logs and stores them in Elasticsearch to be queried in Kibana or Security Onion Console.
On our separate MINERWA server we receive mirrored traffic via RSPAN on 4 separate interfaces (one per RX/TX direction and router). Therefore we run Zeek in distributed mode with 4 workers managed by ZeekControl.
See the architecture of feeding enriched Zeek logs from standalone server to Security Onion on Fig. 2.

Fig. 2: Architectural overview of integrating MINERWA and VTPP

Hands-on deployment

Setting up a standalone Zeek node with logs enrichment feature and integrating it with already running Security Onion infrastructure consists of following steps:

  1. Set up base server according to company standards
  2. Install Zeek
  3. Compile and install zeek-clickhouse plugin
  4. Prepare a Zeek script for anomaly data gathering and logs enrichment
  5. Set up Elastic Agent joined to Fleet of running Security Onion
  6. Modify Elasticsearch Ingest Pipeline adding new fields

Since point 1 is out of scope of this article and points 2 and 3 are described in respective documentations, let’s continue with step 4.

Here is an example of Zeek script that facilitates enrichment data collection from ClickHouse, matching connections analyzed by Zeek against these data and enriching connection logs:

@load policy/tuning/json-logs
@load base/protocols/conn
@load base/frameworks/input
@load-plugin VNET::ClickHouse

module VTPP;

export {
    type WatchlistIdx: record {
        src_ip: addr;
        dst_ip: addr;
    };

    type WatchlistEntry: record {
        detector: string;
        event_name: string;
        metric: double;
    };

    global watchlist: table[addr, addr] of WatchlistEntry = table();
	
    redef record Conn::Info += {
        detector: string &optional &log;
        anomaly: string &optional &log;
        anomaly_metric: double &optional &log;
    };
}

event zeek_init() {
    local ch_info = ClickHouse::Info(
        $hostname = "127.0.0.1",
        $server_port = 9000,
        $database = "default",
        $query = "SELECT mf.ipv4SrcAddr::String AS src_ip, mf.ipv4DstAddr::String AS dst_ip, any(md.detector) AS detector, any(md.event_name) AS event_name, avg(md.metric) AS metric FROM minerwa_detections md JOIN minerwa_flows mf ON md.flow_id = mf.id GROUP BY mf.ipv4SrcAddr, mf.ipv4DstAddr;",
        $poll_interval = 5sec
    );

    Input::add_table([
        $name = "watchlist",
        $source = "clickhouse://127.0.0.1",
        $reader = Input::READER_CLICKHOUSE,
        $mode = Input::REREAD,
        $destination = watchlist,
        $idx = WatchlistIdx,
        $val = WatchlistEntry,
        $config = ClickHouse::config_to_table(ch_info)
    ]);
}

event connection_state_remove(c: connection) &priority=5 {
    if ( [c$conn$id$orig_h, c$conn$id$resp_h] in watchlist ) {
        local anomaly: WatchlistEntry = watchlist[c$conn$id$orig_h, c$conn$id$resp_h];
        c$conn$detector = anomaly$detector;
        c$conn$anomaly = anomaly$event_name;
        c$conn$anomaly_metric = anomaly$metric;
    }
}

At the beginning of the script, two record types (WatchlistIdx, WatchlistEntry) and a global table based on these types are defined, that will serve as key-value store of anomaly data gathered from ClickHouse. Beside that a Conn::Info record type is redefined with desired enrichment fields added. This record type represents Zeek’s connection log.

Right after Zeek starts up, just before it starts processing network input, a zeek_init event is triggered and respective handler called. At that point we set up a connection to ClickHouse DB, defining connection parameters, SQL query and mapping to watchlist table. Columns selected in query must be named or aliased to match field in index and value record types defined at the top.aslasl

Finally a connection_state_remove event callback is defined where the matching of connection against anomaly watchlist table happens. This event is triggered once the connection processing is complete and is ready to be put out to log and removed from memory, so we do the enrichment as a final touch right at the end of connection processing.

So far, we have enriched Zeek logs that could be already used in desired any way. First line of script (@load policy/tuning/json-logs) causes that the output is in JSON format due to requirements of Security Onion, by default Zeek uses tab-separated log format.

Following JSON represents one line of Zeek’s conn.log (note the last three extra fields, that says, the connection was indentified by ai_detector as ping scan with the probability of 77.78%):

{"ts":1780172174.126709,"uid":"CHTpZd2dnsq0MmCrhf","id.orig_h":"47.91.114.16","id.orig_p":8,"id.resp_h":"165.231.211.10","id.resp_p":0,"proto":"icmp","conn_state":"OTH","local_orig":false,"local_resp":false,"missed_bytes":0,"orig_pkts":1,"orig_ip_bytes":50,"resp_pkts":0,"resp_ip_bytes":0,"ip_proto":1,"detector":"ai_detector","anomaly":"ping_scan_4","anomaly_metric":0.7778277397155762}

Since Zeek outputs logs in form of files, we can use Logstash via Elastic Agent to read these files and send them to Security Onion’s Elasticsearch for further processing and analysis (see Fig. 3)
Provisioning an Elastic Agent via Kibana (Management -> Fleet) is quite easy. First thing is to prepare a new agent policy specific for needs of custom Zeek node. It is enough to have only two integrations in this policy:

  1. System – default namespace and default rest of settings
  2. Custom Logs (Filestream) – so namespace, in Custom Filestream Logs set paths to Zeek logs (e.g. /nsm/zeek/logs/current/*.log), set Dataset name to zeek, Exclude files to (broker|capture_loss|cluster|conn-summary|console|ecat_arp_info|known_certs|known_hosts|known_services|loaded_scripts|ntp|ocsp|packet_filter|reporter|stats|stderr|stdout)(\..+)?\.log$ and set custom Processors:
- dissect:
    tokenizer: "/nsm/zeek/logs/current/%{pipeline}.log"
    field: "log.file.path"
    trim_chars: ".log"
    target_prefix: ""
- script:
    lang: javascript
    source: >
      function process(event) {
        var pl = event.Get("pipeline");
	event.Put("@metadata.pipeline", "zeek." + pl);
      }
- add_fields:
    target: event
    fields:
      category: network
      module: zeek
- add_tags:
    tags: "ics"
    when:
      regexp:
        pipeline: "^bacnet*|^bsap*|^cip*|^cotp*|^dnp3*|^ecat*|^enip*|^modbus*|^opcua*|^profinet*|^s7comm*"

This policy is then used in process of adding agent. In case of integrating with Security Onion, it is mostly just copying relevant parts of its configurations. Once Elastic Agent is provisioned, Ingest pipeline, specifically zeek.conn, should be extended with additional new fields:

{
  "rename": {
    "field": "message2.detector",
    "target_field": "minerwa_detector",
    "ignore_missing": true
  }
},
{
  "rename": {
    "field": "message2.anomaly",
    "target_field": "minerwa_anomaly",
    "ignore_missing": true
  }
},
{
  "rename": {
    "field": "message2.anomaly_metric",
    "target_field": "minerwa_anomaly_metric",
    "ignore_missing": true
  }
}
Fig. 3: Dataflow and architecture diagram of Security Onion in distributed deployment. Source: docs.securityonion.net

Now we have integrated custom Zeek node into Security Onion. Depending on the needs and constraints there is also a way to integrate custom Zeek plugin and scripts to Security Onion in a way that Security Onion itself (using Salt) distributes them to sensor nodes running Zeek.

Results & Impact

Having everything set up correctly, we have enriched Zeek logs available in Security Onion ready to be monitored and analyzed with all other data in Security Onion Console. Let’s have a quick peek into what this integration brought us.

Fig. 4 shows pie chart of connections with detected anomaly vs. without. Since we only use a single detector in MINERWA, all eventually detected anomalies come from that. Evaluating exact numbers from this example gives about 1.16% of anomaly connections in selected time window.

Fig. 4: Pie chart showing ratio of anomaly connection to all traffic

In Fig. 5 we have a list of top sources of anomalous connections with respective anomalies. There is an important connection between this and the following figure in terms of behavior analysis.

Fig. 5: List of top sources of anomalous connections

Fig. 6 shows sources and destinations of connections joined by the type of anomaly Sorted by number of connections from source IP. This example show outstanding anomalous communication between IPs 192.0.2.11 and 192.0.2.10. However, table in Fig. 5 shows this source IP in the second place and the one in the first place in the table is sixth in the sankey diagram. Difference is that while 192.0.2.11 has less connections overall, it only communicates anomalously with one peer, while IP 192.0.2.121 communicates with a lot more hosts having shorter communication with each one. Moreover, on the sankey diagram, there are visible also some ping scan communications from 192.0.2.121. Due to the nature of sankey diagram, it is not possible to visualize all combinations of source and destination IPs. Already for this amount of IPs, most of destination IPs are not readable, so to get some reasonable output, some filtering is necessary.

Fig. 6: Sankey diagram of top sources of anomalous connections with type of anomaly and its target

Detected anomalies in previous figures do not necessarily mean malicious traffic. It highly depends on quality of trained model used for detection, but in general anomaly means just diversion from some baseline traffic – might be connections from non-standard location (e.g. connections from China to purely Slovak e-shop), but also just a uncommon administration task (excessive data migration or other communication not occuring on regular basis). Anomaly report is just a signal for operator to further examine the situation.

Speaking about performance, having custom Zeek node that provides enriched logs to Security Onion seemingly does not bring any performance penalty against Security Onion sensor nodes in terms of amount of logs provided and custom node is basically on par with sensor nodes as seen on Fig. 7.

Fig. 7: Time-series chart of amount of logs ingested from particular Zeek nodes

The integration of MINERWA’s AI/ML anomaly detection with Zeek and Security Onion delivers a richer and more actionable security monitoring pipeline:

  • AI-enriched Zeek logs — every Zeek connection record can now carry MINERWA anomaly scores and ML classification labels, giving analysts immediate context without switching tools.
  • Unified SIEM view — enriched logs are fully indexed in Elasticsearch and visible in Security Onion Console and Kibana dashboards alongside Suricata alerts and other data sources.
  • Open-source contribution — the zeek-clickhouse plugin is publicly available, enabling other organizations running ClickHouse-backed analytics to integrate with Zeek.
  • Low operational overhead — the direct C++ plugin eliminates the need for a middleware data store, keeping the pipeline lean and performant even at high traffic volumes.

Conclusion

VTPP demonstrates that purpose-built AI/ML detection systems like MINERWA can be effectively integrated into established open-source security platforms without sacrificing performance or operational simplicity. By contributing the zeek-clickhouse plugin back to the community, we hope to lower the barrier for similar integrations elsewhere.

Next steps include further tuning of anomaly thresholds based on operational feedback, extending dashboard coverage in Security Onion Console, and exploring additional enrichment sources beyond MINERWA.


Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Cybersecurity Competence Centre. Neither the European Union nor the granting authority can be held responsible for them.


Close