Quantcast
Viewing all articles
Browse latest Browse all 11063

Set operations in unix shell (2008)


Image may be NSFW.
Clik here to view.
Set operations in unix shell (2008)

A while ago I wrote about how I solved the Google Treasure Hunt Puzzle Nr. 4 about prime numbers. I took an unusual approach and solved this problem entirely from the Unix shell. The solution involved finding the intersection between a bunch of files containing numbers. This lead me to an idea to write a post about how to do various set operations from the shell by using common utilities such as sort, uniq, diff, grep, head, tail, comm, and others.

I'll cover the following set operations in this article:

Set Membership . Test if an element belongs to a set. Set Equality . Test if two sets contain the same elements. Set Cardinality . Return the number of elements in the set. Subset Test . Test if a given set is a subset of another set. Set Union . Find union of two sets. Set Intersection . Find intersection of two sets. Set Complement . Given two sets A and B , find all elements in A that are not in B . Set Symmetric Difference . Find symmetric difference of two sets. Power Set . Generate all subsets of a set. Set Cartesian Product . Find A x B . Disjoint Set Test . Test if two sets are disjoint. Empty Set Test . Test if a given set is empty. Minimum . Find the smallest element of a set. Maximum . Find the largest element of a set.

Update: I wroteanother post about these operations and created a cheat sheet.

Download cheat sheet: set operations in unix shell (.txt)

To illustrate these operations, I created a few random sets to work with. Each set is represented as a file with one element per line. The elements are positive numbers.

First I created two sets A and B with 5 elements each so that I could easily check that the operations really work.

Sets A and B are hand crafted. It's easy to see that only elements 1, 2 and 3 are in common:

$ cat A $ cat B 3 11 5 1 1 12 2 3 4 2

I also created a set Asub which is a subset of set A and Anotsub which is not a subset of A (to test the Subset Test operation):

$ cat Asub $ cat Anotsub 3 6 2 7 5 8

Next I created two equal sets Aequal and Bequal again with 5 elements each:

$ cat Aequal $ cat Bequal 103 100 102 101 101 102 104 103 100 104

Then I created two huge sets Abig and Bbig with 100,000 elements (some of them are repeated, but that's ok).

The easiest way to generate sets Abig and Bbig is to take natural numbers from /dev/urandom. There are two shell commands that can easily do that. The first is " od " and the second is " hexdump ".

Here is how to create two files with 100,000 natural numbers with both commands.

With hexdump:

$ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Abig $ hexdump -e '1/4 "%u\n"' -n400000 /dev/urandom > Bbig

The "-e" switch specifies a hand-crafted output format. It says take 1 element of size 4 bytes and output it as an unsigned integer. The "-n" switch specifies how many bytes to read, in this case 400000 (400000 bytes / 4 bytes per element = 100000 elements).

With od:

$ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Abig $ od -An -w4 -tu4 -N400000 /dev/urandom | sed 's/ *//' > Bbig

The "-An" switch specifies that no line address is necessary. The "-w4" switch specifies number of bytes to output per line. The "-tu4" says to output unsigned 4-byte numbers and "-N400000" limits the output to 400000 bytes (400000/4 = 100000 elements). The output from od has to be filtered through sed to drop the leading whitespace characters.

Okay, now let's look at various set operations.

Set Membership

The set membership operation tests if an element belongs to a set. We write a ∈ A , if element a belongs to set A , and we write a A , if it does not.

The easiest way to test if an element is in a set is to use " grep " command. Grep searches the file for lines matching a pattern:

$ grep -xc 'element' set

The "-c" flag outputs number of elements in the set. If it is not a multi-set, the number of elements should be 0 or 1. The "-x" option specifies to match the whole line only (no partial matches).

Here is an example of this operation run on set A:

$ grep -xc '4' A 1 $ grep -xc '999' A 0

That's correct. Set A contains element 4 but does not contain element 999.

If the membership operation has to be used from a shell script, the return code from grep can be used instead. Unix commands succeed if the return code is 0, and fail otherwise:

$ grep -xq 'element' set # returns 0 if element ∈ set # returns 1 if element set

The "-q" flag makes sure that grep does not output the element if it is in the set.

Set Equality

The set equality operation tests if two sets are the same, i.e., contain the same elements. We write A = B if sets A and B are equal and A ≠ B if they are not.

The easiest way to test if two sets are equal is to use " diff " command. Diff command compares two files for differences. It will find that the order of lines differ, so the files have to be sorted first. If they are multi-sets, the output of sort has to be run through "uniq" command to eliminate duplicate elements:

$ diff -q <(sort set1 | uniq) <(sort set2 | uniq) # returns 0 if set1 = set2 # returns 1 if set1 ≠ set2

The "-q" flag quiets the output of diff command.

Let's test this operation on sets A, B, Aequal and Bequal:

$ diff -q <(sort A | uniq) <(sort B | uniq) # return code 1 -- sets A and B are not equal $ diff -q <(sort Aequal | uniq) <(sort Bequal | uniq) # return code 0 -- sets A and B are equal

If you have already sorted sets, then just run:

$ diff -q set1 set2 Set Cardinality

The set cardinality operations returns the number of elements in the set. We write | A | to denote the cardinality of the set A .

The simplest way to count the number of elements in a set is to use " wc " command. Wc command counts the number of characters, words or lines in a file. Since each element in the set appears on a new line, counting the number of lines in the file will return the cardinality of the set:

$ wc -l set | cut -d' ' -f1 Cut command is necessary because "wc -l" also outputs the name of the file it was ran on. The cut command o

Viewing all articles
Browse latest Browse all 11063

Trending Articles