At the first annual O'Reilly Security conference, Groovescale Principal Jeff Henrikson presented "Anomaly Detection: A Cybersecurity Streaming Data Pipeline Using Kafka and Akka Clustering".

Video will be posted once the conference has released it.

Anomaly Detection at Scale

  • 1. ANOMALY DETECTION AT SCALE: A CYBERSECURITY STREAMING DATA PIPELINE USING KAFKA AND AKKA CLUSTERING O'Reilly Security Conference NYC, November 2, 2016 Jeff Henrikson Groovescale http://www.groovescale.com
  • 2. OUTLINE framing problem statement streaming tech concepts outline of solution architecture, learnings
  • 3. FRAMING
  • 4. Why build predictive models? Models continue to do usefulwork a er humans are not looking Models are based on assumptions Only humans can make assumptions
  • 5. INTRUSION DETECTION 1) Log Data 2) Configure rules 3) Human awareness examines alarms and logs 4) Quick action taken (e.g. deauthorize) 5) Re-authorize once human awareness deems longer-term mitigation is adequate Sometimes for high-confidence rules we allow 2) to trigger 4) without human intervention
  • 6. HOW IS A SKILLED PERSON'S AWARENESS CAN BE MORE EFFECTIVELY GUIDED? 1) Matching of network behavior against localized rules 2) Predictive modeling of the aggregate network behavior
  • 7. HOW IS A SKILLED PERSON'S AWARENESS CAN BE MORE EFFECTIVELY GUIDED? 1) Matching of network behavior against localized rules 2) Predictive modeling of the aggregate network behavior Hypothesis: Let's see if 2 is better.
  • 8. AI Artificial Intelligence "IA" Intelligence Augmented From Building practical AI systems Adam Cheyer, (Siri, Sentient, and Viv Labs) Strata 2016
  • 9. INTRUSION DETECTION TOOLS AS "INTELLIGENCE AUGMENTED" Intruders are trying to evade detection. Let's not worry about making the human protector of the network going away. Probably not possible given evasive response.
  • 12. CAPTURE SERVER dumpcap (from Wireshark)
  • 13. NETFLOW (V5) BASICS Attributes: Source/Destination IP Source/Destination Port Input interface Metrics: Number ofPackets, Sum of Bytes, Start Time, End Time. IPv4 only https://nsrc.org/workshops/2015/sanog25-nmm-tutorial/materials/netflow.pdf
  • 14. Functional Requirements Produce netflow from PCAP Score netflow for anomalies Control the number of anomalous events brought to the human expert's attention
  • 15. Nonfunctional Requirements Process line rate 10Gb/s Be within 2x perf of tcpdump Be within 4x of netflow latency Do not add single points of failure
  • 19. EXTERNAL DESIGN System coupling: Do not prescribe deploying kafka upstream or downstream (Which Kafka version? Which language binding?) External APIs: Ingress HTTP POST octet encoding Egress HTTP GET Long Polling
  • 21. INTERNAL DESIGN Record state only in: Kafka Pcap temporary files on local fs Need to write block id to EFH and dedupe for sumsto be correct in the presence of retries Prefer late delivery to dropping data Prefer reading capture time in data stream to wall clock time
  • 22. Akka-cluster in one slide: Framework for Actor-based concurrency Program in Scala or Java Akka-cluster more general than map reduce, data pipelines Makes use local and remote resources work the same
  • 23. MINIMUM VIABLE PREDICTIVE MODEL 1) Take Netflow metrics: sum(bytes), sum(packets), count 2) For each metric, compute mean and variance 3) Emit an "anomaly" when signal exceeds (mean + 3.0*sqrt(variance)) Meets minimum requirement: controls the number of events brought to the human expert's attention
  • 24. EXERCISE FOR THE READER Model for periodicity: Ihler et al, Adaptive Event Detection with Time–Varying Poisson Processes, ACM SIGKDD 2006 http://www.datalab.uci.edu/papers/event_detection_kdd06.pdf
  • 25. Symmetrical mapping of docker containers to hosts: DEPLOYMENT
  • 26. RESULTS Qualitatively, users can find relevant Anomalies in a reasonable sized stream System operates reliably Numbers are correct within assumptions
  • 28. SO WHY KAFKA VS ANY OTHER STREAMING COMPONENT? https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/comment-page-1/
  • 30. STREAMING DATA LITERATURE: A data entity is created by one module, is passed from module to module until it is no longer needed and is then destroyed. . . . Punched card accounting systems exemplify this environment. J. P. Morrison, "Data Stream Linkage Mechanism", IBM Systems Journal, 1978. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=45DED06EC91474F5938A9E05CC3D5A61? doi=
  • 31. BIND ARCHITECTURAL COUPLINGS EARLY SO THAT ARCHICTECTURAL COMPONENTS CAN BE CHOSEN WITH AMPLE EVIDENCE Examples of components: which database which streaming engine Examples of couplings: format of data (e.g. newline delimited json) how to notify how to checkpoint
  • 32. HTTP COUPLING: WINS Win #1: Can't get access to pcap over API Win #2: Only RHEL-distributed reqs (perl-core, curl) required for ingress Win #3: Upgrade kafka when improved
  • 33. HTTP COUPLING: WIN #3: UPGRADE WHEN READY Kafka Version 0.9.0 Partition by Hash x x x Write timestamp to message x x Read seek by timestamp x
  • 34. LEARNING #1 https://github.com/akka/reactive-kafka Using this library in place of KafkaConsumer
  • 35. LEARNING #2, HIDING IN PLAIN SIGHT http://www.reactive-streams.org/
  • 36. FAVOR INTEGRATION TESTING TO UNIT TESTING Ingress, egress have optional flag placebo={true,false}. Default to true. Every deployment simulates low volume placebo sinks, sources. Transmit heartbeats when each component is sure to have made forward progress.
  • 37. ON EVALUATING FAULT TOLERANCE AND SCALABILITY My smart buddy LinkedIn runs it in production The NSA Can we do better?
  • 38. ON EVALUATING FAULT TOLERANCE AND SCALABILITY: The idea: Create linked containers for app Use tc to tell netfilter to drop and/or delay packets Run simulated data source
  • 39. ON EVALUATING FAULT TOLERANCE AND SCALABILITY: Hands on create container: Hands on with the container: Hands on with the host: (docker-machine's boot2docker has tc built-in) docker run -it --rm ubuntu:14.04.2 bash root@07e330775e98:/# apt-get update && apt-get install -y ethtool root@07e330775e98:/# ethtool -S eth0 NIC statistics: peer_ifindex: 875 dev=$(ip link | grep '^875:') tc qdisc change dev $dev root netem delay 100ms 20ms distribution normal tc qdisc change dev eth0 root netem loss 0.1%
  • 40. Myth: Code should always go into docker containers through an image
  • 41. Myth: Code should always go into docker containers through an image Alternative: docker run -v $dirSrc:$dirSrc # to convey source code docker exec # to restart program
  • 42. Myth: A docker image is something that came from a Dockerfile:
  • 43. Myth: A docker image is something that came from a Dockerfile: Alternative docker run ansible-playbook -c local docker commit
  • 44. ACKNOWLEDGEMENTS Ilya Levner Gunjan Gupta, Lightsphere AI Trey Blalock, Firewall Consulting
  • 45. RECOMMENDED READING I Heart Logs, Jay Kreps (creator of Kafka) Akka in Action, Roestenburg et al Released Sept 30, 2016 Scala for the Impatient, 1e, Cay Horstman Second edition coming December 2016 https://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382 https://www.amazon.com/Akka-Action-Raymond-Roestenburg/dp/1617291013 https://www.amazon.com/Scala-Impatient-Cay-S-Horstmann/dp/0321774094
  • 46. READINGS ON LOW LATENCY DATA ENGINEERING (ORGANIZED BY COMMUNITY) Community Title URL Reactive The Reactive Manifesto http://www.reactivemanifesto.org/ Reactive Streams http://www.reactive-streams.org/ Kafka I Heart Logs, Jay Kreps, 2014 https://www.amazon.com/Heart-Logs-Stream-Processing- Integration/dp/1491909382 Kafka: The Definitive Guide, prerelease/2017 https://www.amazon.com/Kafka-Definitive-Real-time-stream- processing/dp/1491936169 NiFi The core concepts of NifFi http://nifi.apache.org/docs/nifi-docs/html/overview.html#the-core- concepts-of-nifi Flow Based Programming Flow-Based Programming, J. Paul Morrison, 2010 https://www.amazon.com/Flow-Based-Programming-2nd- Application-Development/dp/1451542321 Storm Big Data, Nathan Marz, 2015 https://www.amazon.com/Big-Data-Principles-practices- scalable/dp/1617290343
  • 47. QUESTIONS?


Network packet broker hardware is one way to acquire network monitoring data at scale for on-premises intrusion detection. Deployment of this kind of hardware is easy to understand. However, the result is a highly concentrated network capture source. Thus, the next challenge in developing an intrusion detection system becomes finding the tiny amount of relevant information in a very large stream—and doing so efficiently.

Jeff Henrikson presents a data pipeline for digesting useful analytics for intrusion detection from aggregated PCAP, with an emphasis on its highest throughput stage: conversion of PCAP to a netflow-like format. The main building blocks for the system are libpcap, Kafka, Scala, Akka, and Docker. The pipeline runs efficiently at 10 GB a second with end-to-end latency of two minutes and processes streams without approximation. Any individual node can be removed from the system without disruption. Jeff shows how the upfront design compared to the final design and shares experience with the building blocks that the team discovered along the way.

See also the conference's page for the talk: O'Reilly Security 2016