AutomataZoo: a Benchmark Suite for Automata Processing
AutomataZoo
High performance automata processing engines are traditionally evaluated against a limited set of regular expression rulesets. These serve as valid, real-World example use cases, but they only represent a small proportion of all automata-based applications. With the recent availability of architectures and software frameworks for automata processing, many new applications have been discovered that benefit from automata processing. These demonstrate a broad variety of characteristics that differ from prior regular expression-based applications and warrant their own benchmarks.
We released an earlier benchmark suite, ANMLZoo, in 2016 to fulfill this need. We have improved upon ANMLZoo with AutomataZoo [Wadden et al., IISWC 2018] in the following ways:
– The suite of benchmarks is no longer standardized to a particular architecture, and does not inherit the same architectural biases as ANMLZoo.
– The benchmarks implement full kernels, which allows for comparisons between automata and non-automata approaches.
– The suite includes open-source tools for generating benchmark automata and inputs of various sizes, allowing for design space explorations.
BENCHMARKS:
– Snort: A widely used network intrusion detection system.
– ClamAV: A virus-detection tool that relies on a publicly-available database of malware patterns.
– Protomata: An automata-based application that searches for a set of 1309 protein motif patterns from the PROSITE database.
– Brill Tagging: A rule-based approach to part-of-speech tagging.
– Random Forest: A machine learning model based on ensembles of decision trees.
– Hamming Distance: A string-scoring kernel that accepts inputs that are within a set hamming distance of a configured pattern.
– Levenshtein Distance: A string-scoring kernel that accepts inputs that are within a set edit distance of a configured pattern.
– Sequence Matching: An automata application that counts sorted sequences of item sets to identify frequently-occurring sets.
– Entity Resolution: An automata application that attempts to find duplicate entries in a streaming database.
– CRISPR/Cas9: An automata application that enabled gene editing by identifying targeted locations.
– YARA: An automata application that discovers malware described in the YARA malware pattern description language.
– File Carving: An automata application that identifies files in a stream of input bytes.
– Pseudo Random Number Generation (PRNG): An automata application that models Markov Chains with finite automata to generate high-throughput PRNG streams.
The benchmark suite can be found here: https://github.com/hplp/AutomataZoo
[Wadden et al., IISWC 2018] Wadden, J., Tracy II, T., Sadredini, E., Wu, L., Bo, C., Du, J., Wei, Y., Wallace, M., Udall, J., Stan, M., and Skadron, K. “ANMLZoo: A Benchmark Suite for Exploring Bottlenecks in Automata Processing Engines and Architectures.” 2018 IEEE International Symposium on Workload Characterization (IISWC’18). IEEE, 2018.
We hope this new benchmark suite will fulfill your current and future research efforts related to automata computing.