Counting Characters

June 9, 2013

A few days ago I read a STL Lambda Lounge post about Counting DNA Nucleotides. After solving the Nucleotides problem in Objective-C, I decided to solve a slightly different problem. I decided to count and print the number of times each character (assume ASCII text) appears in Moby Dick. I used the same data structure to solve both problems: an array of unsigned int values, where the array index represents the ASCII character and the value is the count.

Reading Moby Dick using Grand Central Dispatch

// This function blocks until all characters are counted.
void CBRCountCharactersInFileAtPath(NSString *path, unsigned int *characters)
{
    // The dispatch_io_read is an asynchronous call. We use a semaphore to provide synthetic synchronization.
    dispatch_semaphore_t semaphore = dispatch_semaphore_create(0);

    dispatch_queue_t readQueue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    dispatch_io_t channel = dispatch_io_create_with_path(DISPATCH_IO_STREAM, [path UTF8String], 0, O_RDONLY, readQueue, NULL);

    // SIZE_MAX means keep going until we reach EOF.
    dispatch_io_read(channel, 0, SIZE_MAX, readQueue, ^(bool done, dispatch_data_t data, int error) {
        if (done) {
            // All bytes have been read. Let's notify that we are done.
            dispatch_semaphore_signal(semaphore);
        } else if (data != NULL) {

            // As we read, we count.
            NSString *string = CBRStringFromDispatchData(data);
            CBRCountCharactersInString(string, characters);
        }
    });

    // Block the calling queue/ thread until the IO read completes (i.e. the channel is "done")
    dispatch_semaphore_wait(semaphore, DISPATCH_TIME_FOREVER);

    // What we open, we must close.
    dispatch_io_close(channel, 0);
}

The above code sets up and executes streaming file IO using Grand Central Dispatch. Now let's look at the two functions named CBRStringFromDispatchData and CBRCountCharactersInString.

Here is the code used to create a NSString from a dispatch_data_t.

NSString *CBRStringFromDispatchData(dispatch_data_t data)
{
    size_t dataSize = dispatch_data_get_size(data);
    NSMutableString *string = [[NSMutableString alloc] initWithCapacity:dataSize];
    dispatch_data_apply(data, ^bool(dispatch_data_t region, size_t offset, const void *buffer, size_t size) {
        [string appendFormat:@"%.*s", (unsigned int)size, buffer];
        return true;
    });
    return string;
}

Here is the code that tracks the character counts.

void CBRCountCharactersInString(NSString *string, unsigned int *characters)
{
    for (unsigned int index = 0; index < [string length]; index++) {
        characters[[string characterAtIndex:index]]++;
    }
}

Sample Program

#import "CBRCounter.h"

int main(int argc, const char * argv[])
{
    @autoreleasepool {
        unsigned int characters[256] = {0};
        CBRCountCharactersInFileAtPath(@"/path/to/mobydick.txt", characters);
        CBRPrintCharacterCount(characters);
    }
    return 0;
}

Printing the Character Counts

void CBRPrintCharacterCount(unsigned int *characters)
{
    // This example only prints the "printable" characters.
    for (unsigned int character = ' '; character <= '~'; character++) {
        NSLog(@"%c: %d", (char)character, characters[character]);
    }
}

Example Output

A: 2725
B: 1466
C: 1184
D: 802
E: 1361
F: 864
G: 751
H: 1497
I: 3641
J: 261
K: 185
L: 959
M: 782
N: 1241
O: 1050
P: 1158
Q: 323
R: 895
S: 2277
T: 2568
U: 284
V: 181
W: 1326
X: 25
Y: 360
Z: 38

Verify the Counts

I verified my program's results using TextEdit.

  1. Open "Moby Dick" in TextEdit
  2. Type Command+F
  3. Deselect "Ignore Case"
  4. Select "Contains"
  5. Enter Character
  6. View TextEdit's Result