Quantcast
Channel: CodeSection,代码区,Linux操作系统:Ubuntu_Centos_Debian - CodeSec
Viewing all articles
Browse latest Browse all 11063

Countish: a library for approximate frequency counting

$
0
0
Introducing countish: a library for approximate frequency counting

tl;dr get countish now

Background

So every once in a while, I get asked a simple question that leaves me scratching my head.

In this case 2 years ago someone asked me: “How can I find the most popular urls?”. The algorithm is pretty simple. Keep a map of counters and when you see a url increment the appropriate counter.

counts := make(map[string]int) for _, url := range urls { counts[url]++ }

But this algorithm isn’t completely satisfactory. For one thing it uses O(n) memory, where n is the number of distinct urls. For low cardinality sets, this is a great algorithm, but if you’re counting something that may contain a ton of different values (such as urls) you could end up using a ton of data. It seems pretty obvious that if you want to count things you have to store them and if you want to preciesly count N distinct things you have to have at least N counters. I was aware of bloom filters and counting filters . These datastructures allow for very space-efficient methods of estimating counts but these datastructures don’t retain information about the original keys. They allow you to ask questions like: “How many hits did the home page get?” but not: “List all pages that exceeded 1% of traffic”.

Fast forward a couple years into the future, and I’m running a service in production which happens to handle a big firehose of data. High cardinality data. People keep asking me questions about heavy hitters and realtime popularity info. I know how to answer this question with a bunch of machines running map-reduce, but I’d like to be able to answer this question in realtime without worrying about memory exhaustion. My servers have more important uses for their RAM.

Thanks to the magic of google, I recently discovered a fantastic paper approximate frequency counts over data streams . In this paper they discuss some methods of approximate frequency counting, which seem to offer bounded memory for distributions likely to be seen in the real world and logarithmic worst case space complexity.

I can work with logarithmic. Logarithmic means that I might use another Xmb of ram if my server runs for another month and another Xmb of ram if the uptime is a year.

There are 2 algorithms discussed: sticky sampling and lossy counting. I decided to implement these in Go and see how well they perform. (Disclaimer: my implementation is very unoptimized. the original paper mentions a trie and a mmap’d buffer)

If you want explanations and visualizations of how these algorithms work, check out Michael Vogiatzis’s excellent post: Frequency Counting Algorithms over Data Streams . The rest of this post is going to be using my new countish library and examining the results in terms of performance, accuracy, and memory usage.

Show me the money ^H^H^H^H performance!

I’ve implemented these algorithms in go in the countish repo. Let’s see how they perform. First we are going to need some data. Of course searching for good test data ended up delaying this blog post by a couple hours as I scoured the internet for the infamous “star wars kid” video apache logs .

After failing at that, I flailed finally stumbled onto a random log-sharing website . With a nice sample of web server logs. This isn’t what I’d call “big data” but it’s a useful dataset. I’m going to examine some logs from a bluecoat proxy.

test -f bluecoat_proxy_big.zip || wget http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip if ! test -f Demo_log_001.log ; then unzip bluecoat_proxy_big.zip fi head -5 Demo*log | grep GET | awk '{print $11$12}' www.yahoo.com/ www.inmobus.com/wcm/assets/images/imagefileicon.gif images.netmechanic.com/images/webtools/webmaster_tools.gif www.ositis.com/tests/testconnectivity.asp

Let’s find the top requests made. This dataset is unusual in that no url makes up more than .07% of traffic, so let’s examine the top .05% of urls using an exact method. We’ll use the exact “naive” countish implementation.

go get github.com/shanemhansen/countish/cmd/countish cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -impl naive -threshold .005 2>&1 0.007344 energydata.aws.com//WxDataISAPI/WxDataISAPI.dll 0.006139 vm.boldcenter.com/aid/5707504118312057803/bc.vm 0.006537 energydata.aws.com//ForecastISAPI/ForecastISAPI.dll 0.006692 rad.msn.com/ADSAdClient31.dll 3.46user 0.31system 0:07.93elapsed 47%CPU (0avgtext+0avgdata 273716maxresident)k 1368inputs+0outputs (203major+52001minor)pagefaults 0swaps

Looks like the total memory usage is about 261 megabytes and the runtime is 3.5s on my machine. Let’s compare those results to a lossy implementation, using a .001% error tolerance.

cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -error-tolerance .001 -impl sticky -threshold .005 2>&1 | awk '$1>.005 {print $0}' 0.006792 rad.msn.com/ADSAdClient31.dll 0.006236 vm.boldcenter.com/aid/5707504118312057803/bc.vm 0.007443 energydata.aws.com//WxDataISAPI/WxDataISAPI.dll 0.006628 energydata.aws.com//ForecastISAPI/ForecastISAPI.dll 3.04user 0.18system 0:07.69elapsed 42%CPU (0avgtext+0avgdata 12044maxresident)k 0inputs+0outputs (1major+3754minor)pagefaults 0swaps

First thing we notice is that the runtime is about 20% faster. The memory usage is 12mb, a 20x reduction! The results take a little interpretation. You’ll see that aws energydata, msn amd vm.boldcenter.com are all included. You’ll also notice that I am post-processing the data extract values who’s estimated frequency is > .05%.

Let’s compare to lossy counting. Note: I’m post processing results that have an estimated frequency > .005.

cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -impl lossy -threshold .005 2>&1 | awk '$1>.005 {print $0}' 0.006637 energydata.aws.com//ForecastISAPI/ForecastISAPI.dll 0.006792 rad.msn.com/ADSAdClient31.dll 0.006239 vm.boldcenter.com/aid/5707504118312057803/bc.vm 0.007444 energydata.aws.com//WxDataISAPI/WxDataISAPI.dll 3.97user 0.21system 0:07.80elapsed 53%CPU (0avgtext+0avgdata 7744maxresident)k 0inputs+0outputs (2major+2110minor)pagefaults 0swaps

Lossy counting has a slightly larger runtime, but offers a 50% memory reduction compared to sticky sampling. Nearly a 30x memory reduction from the exact method, while still returning very accurate results!

Conclusions

Empirically, approximate counting seems to be a huge win. Sticky sampling offers greatly reduced memory usage and increased performance on high cardinality sets without measurably degrading results. This sort of algorithm is ideal for building a google analytics realtime experience or integrating into your favorite multitenant stream processor to report on failing urls.


Viewing all articles
Browse latest Browse all 11063

Trending Articles