Countish: a library for approximate frequency counting

Introducing countish: a library for approximate frequency counting

tl;dr get countish now

Background

So every once in a while, I get asked a simple question that leaves me scratching my head.

In this case 2 years ago someone asked me: “How can I find the most popular urls?”. The algorithm is pretty simple. Keep a map of counters and when you see a url increment the appropriate counter.

counts := make(map[string]int) for _, url := range urls { counts[url]++ }

But this algorithm isn’t completely satisfactory. For one thing it uses O(n) memory, where n is the number of distinct urls. For low cardinality sets, this is a great algorithm, but if you’re counting something that may contain a ton of different values (such as urls) you could end up using a ton of data. It seems pretty obvious that if you want to count things you have to store them and if you want to preciesly count N distinct things you have to have at least N counters. I was aware of bloom filters and counting filters . These datastructures allow for very space-efficient methods of estimating counts but these datastructures don’t retain information about the original keys. They allow you to ask questions like: “How many hits did the home page get?” but not: “List all pages that exceeded 1% of traffic”.

Fast forward a couple years into the future, and I’m running a service in production which happens to handle a big firehose of data. High cardinality data. People keep asking me questions about heavy hitters and realtime popularity info. I know how to answer this question with a bunch of machines running map-reduce, but I’d like to be able to answer this question in realtime without worrying about memory exhaustion. My servers have more important uses for their RAM.

Thanks to the magic of google, I recently discovered a fantastic paper approximate frequency counts over data streams . In this paper they discuss some methods of approximate frequency counting, which seem to offer bounded memory for distributions likely to be seen in the real world and logarithmic worst case space complexity.

I can work with logarithmic. Logarithmic means that I might use another Xmb of ram if my server runs for another month and another Xmb of ram if the uptime is a year.

There are 2 algorithms discussed: sticky sampling and lossy counting. I decided to implement these in Go and see how well they perform. (Disclaimer: my implementation is very unoptimized. the original paper mentions a trie and a mmap’d buffer)

If you want explanations and visualizations of how these algorithms work, check out Michael Vogiatzis’s excellent post: Frequency Counting Algorithms over Data Streams . The rest of this post is going to be using my new countish library and examining the results in terms of performance, accuracy, and memory usage.

Show me the money ^H^H^H^H performance!

I’ve implemented these algorithms in go in the countish repo. Let’s see how they perform. First we are going to need some data. Of course searching for good test data ended up delaying this blog post by a couple hours as I scoured the internet for the infamous “star wars kid” video apache logs .

After failing at that, I flailed finally stumbled onto a random log-sharing website . With a nice sample of web server logs. This isn’t what I’d call “big data” but it’s a useful dataset. I’m going to examine some logs from a bluecoat proxy.

test -f bluecoat_proxy_big.zip || wget http://log-sharing.dreamhosters.com/bluecoat_proxy_big.zip if ! test -f Demo_log_001.log ; then unzip bluecoat_proxy_big.zip fi head -5 Demo*log | grep GET | awk '{print $11$12}' www.yahoo.com/ www.inmobus.com/wcm/assets/images/imagefileicon.gif images.netmechanic.com/images/webtools/webmaster_tools.gif www.ositis.com/tests/testconnectivity.asp

Let’s find the top requests made. This dataset is unusual in that no url makes up more than .07% of traffic, so let’s examine the top .05% of urls using an exact method. We’ll use the exact “naive” countish implementation.

go get github.com/shanemhansen/countish/cmd/countish cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -impl naive -threshold .005 2>&1 0.007344 energydata.aws.com//WxDataISAPI/WxDataISAPI.dll 0.006139 vm.boldcenter.com/aid/5707504118312057803/bc.vm 0.006537 energydata.aws.com//ForecastISAPI/ForecastISAPI.dll 0.006692 rad.msn.com/ADSAdClient31.dll 3.46user 0.31system 0:07.93elapsed 47%CPU (0avgtext+0avgdata 273716maxresident)k 1368inputs+0outputs (203major+52001minor)pagefaults 0swaps

Looks like the total memory usage is about 261 megabytes and the runtime is 3.5s on my machine. Let’s compare those results to a lossy implementation, using a .001% error tolerance.

cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -error-tolerance .001 -impl sticky -threshold .005 2>&1 | awk '$1>.005 {print $0}' 0.006792 rad.msn.com/ADSAdClient31.dll 0.006236 vm.boldcenter.com/aid/5707504118312057803/bc.vm 0.007443 energydata.aws.com//WxDataISAPI/WxDataISAPI.dll 0.006628 energydata.aws.com//ForecastISAPI/ForecastISAPI.dll 3.04user 0.18system 0:07.69elapsed 42%CPU (0avgtext+0avgdata 12044maxresident)k 0inputs+0outputs (1major+3754minor)pagefaults 0swaps

First thing we notice is that the runtime is about 20% faster. The memory usage is 12mb, a 20x reduction! The results take a little interpretation. You’ll see that aws energydata, msn amd vm.boldcenter.com are all included. You’ll also notice that I am post-processing the data extract values who’s estimated frequency is > .05%.

Let’s compare to lossy counting. Note: I’m post processing results that have an estimated frequency > .005.

cat Demo*log | grep --text -i GET | awk '{print $11$12}' | /usr/bin/time countish -impl lossy -threshold .005 2>&1 | awk '$1>.005 {print $0}' 0.006637 energydata.aws.com//ForecastISAPI/ForecastISAPI.dll 0.006792 rad.msn.com/ADSAdClient31.dll 0.006239 vm.boldcenter.com/aid/5707504118312057803/bc.vm 0.007444 energydata.aws.com//WxDataISAPI/WxDataISAPI.dll 3.97user 0.21system 0:07.80elapsed 53%CPU (0avgtext+0avgdata 7744maxresident)k 0inputs+0outputs (2major+2110minor)pagefaults 0swaps

Lossy counting has a slightly larger runtime, but offers a 50% memory reduction compared to sticky sampling. Nearly a 30x memory reduction from the exact method, while still returning very accurate results!

Conclusions

Empirically, approximate counting seems to be a huge win. Sticky sampling offers greatly reduced memory usage and increased performance on high cardinality sets without measurably degrading results. This sort of algorithm is ideal for building a google analytics realtime experience or integrating into your favorite multitenant stream processor to report on failing urls.

Countish: a library for approximate frequency counting

Trending Articles

[奇怪机翻组] 双梦相牵 / ふたりの夢もち [RJ01259078] [WebRip] [1080P HEVC-10Bit AAC 2.0]...

HONDA CITY VTI-S 菜單分享

#新闻拍一拍# 新的摩尔定律：黄氏定律

一如既往的痴情能否打动月瓶金蝎？ (豆瓣月亮水瓶小组)

求購按摩椅~'~

「粉红」不是霸凌辜莞允杠部落客：我爽在哪？

Intel 7-10代集成显卡驱动31.0.101.2137完整版

涉Gotbit加密货币市场操纵台男纽约被捕

臺灣法治會計學會2025年第三季研討會

不靠姊姊！張柏芝弟弟開計程車維生

关门一家亲：习远平、张澜澜、徐才厚

剑指offer——24.二叉树中和为某一值的路径

苏珊米勒日晕05.11｜狮子鼓励孩子；处女相信自己 (豆瓣 SUSAN MILLER小组)

【台積電IT卓越新戰略5】台積IT組織5年三次大調整，要靠平臺工程讓DevOps創新再加速

【日语无字】春之钟.Haru.no.kane.1985.JAP.vhsrip.NoSub.by.xiongzaixia&vivi

美籍老公不讓步李愛綺兒子念公立小學

新华网这张照片绝了!直讽江泽民宋祖英淫乱组图

湖州师范学院音乐学院开发的 Kontakt 8 明代魏氏乐琵琶/瑟/月琴音源即将发布

Google Chrome Portable 140.0.7339.186 穩定版免安裝中文版 - Google 瀏覽器

免费翻墙节点大全