Friday, October 30, 2020

20 Year Old Bug and Counting

That selfishness character in some open-source projects

Continuing with the topic of web servers and my work, as of the last days, of writing up some hands-on examples, I'm writing this post about a bug I found in Firefox. It's not new.

Before I do so, I do not want to pass the opportunity to touch on a topic that has been bugging me for quite some time now.

Let me start with a few examples.

I downloaded the source-code of Firefox -read below why. Nothing on the Mozilla project page alerts you of the size of the download required. It turns out that code is about 4GiB of download. But then you have to run some configuration scripts which also download some more stuff.

Don't ask me how much though: it's not mentioned nor notified anywhere!

That download forced me to update my Homebrew installation on my laptop. Yeah, I hadn't done so in quite a while. Hence, there where plenty of packages that were updated. Well, not always do you get warned about how much data it's gonna cost you nor will it ask you for confirmation to proceed.

While writing my previous post I decided to test the download of River, the Apache project originally developed by Sun under the name of Jini: Nowhere does it say how large is it gonna be such download...although this time a gzed-tarball ended up being somewhat less than 6MiB.

While browsing Bugzilla about the aforementioned bug, I came up way to often with a reply by what seemed Mozilla developers asking the posters commenting on buggy behaviour for a minimal example that would showcase the problem. It so is the case that what brought me to find out that same bug was precisely writing a minimal-case scenario for use with Firefox, namely, a basic server delivering some very simple HTML code back to the client.

So, developers of a huge project like Firefox, answering a question related to networking...and they cannot write themselves a minimal server in C that serves some 10 lines of HTML?

When I was doing Sys Admin work and would receive some request about an incident or something not working as they expected, I would always

Quickly try to guess what might really be going on, then
Come up with a minimal example showcasing what my guess was, and only then
Ask the user for more clarification, if my little experiment wouldn't replicate the problem.

I find both types of examples unhelpful and selfish -for the lack of a better word right now. That's all I'm gonna say about that and open-source right now.

A 20 year old bug

And now about what I promised. Here is the bug:

Any web page whose HTML code starts like the following will trigger the bug in Firefox

<!doctype html>
<html>
<head>
<title>A 20yr-old bug</title>
<meta charset='utf-8'>
</head>
<body>...</body>
</html>

The bug consist in Firefox duplicating the GET request: The full pages loads twice!!

While this seems irrelevant, there are case scenarios where this can become an actual burden.

Some mobile plans do not offer that much of data, and what they offer is expensive.

Among those plans are Roaming. The following roaming plan is the first that popped up when googling (prices are in CAD):

| Area \ Co. | Rogers ($/MiB) |

| US | 0.16 |

| EU | 0.50 |

| Asia | 1.00 |

| South A. | 1.00 |

Mind you, a typical newspaper front page, e.g., the one from The Guardian below amounts to somewhat over 1MiB of data -there is not just text, but also pictures & ads.

So, If you went to visit your family in Malaysia and check out a newspaper with such an "stutter"-triggering HTML code, you'd be paying $$$ 2 per page view if you use Firefox! Double as much as if you were using Safari, $$ 1$ . I think Chrome also works well -haven't checked yet-, hence only $$$ 1 as well.

Why?

Because, not knowing better, those at Mozilla decided once that it was easier for the developers to force the browser requesting again the page every time the charset meta tag is not the first tag within the head tag!

So, this HTML code doesn't trigger the bug:

<!doctype html>
<html>
<head>
<meta charset='utf-8'>
<title>A 20yr-old bug</title>
</head>
<body>...</body>
</html>

Here the main bugzilla entries related to this tag. Notice how along these over 20 years the bug has been repeatedly rediscovered, fixed or simple disregarded!

Bugzilla entry opened 20yrs ago updated 10 months ago. That is, it's still open -as of October 2020!
Opened 17yrs ago, closed 8yrs ago...Status: Resolved!

Various postings in both of them show people have been re-discovering it and landing on either bug entry depending on their luck.

How often does that happen?

I don't know, yet.

But I do know that the newspaper The Guardian will trigger that bug

The same applies to The New York Times

or the french one Le Figaro

Newspapers like Toronto Star, Zeit Online, or El Diario.es are "stutter" free: they have a charset meta tag that come as the first tag within the head tag.

All this applies only to their front pages, which is the only HTML code I checked.

Google search page in Canada doesn't seem to have any charset tag, so it won't trigger this bug.

Ah, that's funny what I just saw: Using Safari at least, as the pics above show, Apple's web page www.apple.com does not show any source code! Very reassuring, Apple... Ok, checked with Firefox's Inspector (under tools->web developer menu) and, indeed, apple has a charset meta tag deep down its head...

I need to double check this bug while capturing packets with a tcpdump... This duplication of requests could be happening way more often than I initially thought every time we use Firefox to browse the web!

I'll update this post accordingly. Stay tuned.

A minimal setup to trigger the bug

You can do it just in a high-level, meaning setting up a webserver with a counter of how many users have requested a given page and showing that number to any one who connects.

Another way is compiling the following C code and running it. If you do so in your local machine, point your browser to http://localhost:8080. Do it using Firefox and other browser like Safari, Chrome, Opera, Konqueror...

Whatch how the counter jumps always by 2 when using Firefox!

#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <netdb.h>
#include <string.h>     //memset
#include <unistd.h>        //close

#include <stdio.h>        //printf
#include <stdlib.h>        //calloc

#define PORT "8080"
#define RDBUFLEN 1024
#define WRBUFLEN RDBUFLEN

int main(){
   struct addrinfo hint;
   memset(&hint,0,sizeof(hint) );
   hint.ai_family = AF_INET6 ;
   hint.ai_socktype = SOCK_STREAM ;
   hint.ai_flags = AI_PASSIVE ; //for use w/ bind

   //getaddrinfo
   struct addrinfo *sadd ;
   getaddrinfo(0, PORT, &hint, &sadd);

   //socket
   int server_s = socket(sadd->ai_family,
                sadd->ai_socktype,
                sadd->ai_protocol
   );

   //bind
   bind(server_s, sadd->ai_addr, sadd->ai_addrlen);

   //listen
   listen(server_s,3);

   //accept: blocking : waits for client to connect
   struct sockaddr cadd;
   socklen_t cadd_len = sizeof(cadd);
   int client_s = 1;
   int n=0;
   while( (client_s=accept(server_s, &cadd, &cadd_len) )>0 ){

        //recv
        char* buf = (char*) calloc(RDBUFLEN,sizeof(char));

        size_t rcvd = recv(client_s, buf, RDBUFLEN,0);
        *(buf+rcvd)='\0';
        printf("Received:\n%s\n",buf);

        //send
        char response[WRBUFLEN] =
            "HTTP/1.1 200 OK\r\n"
            "Connection: close\r\n"
            "Content-type: text/html\r\n\r\n";
        send(client_s,response,WRBUFLEN,0);

        sprintf(response,
            "<!doctype html>\r\n"
            //"<html><head><meta charset='utf-8'><title>Lemon Inc.</title></head>\r\n"     //Firefox sends 1 request only
            //"<html><head><title>Lemon Inc.</title></head>\r\n"                         //Firefox sends 1 request only
            "<html><head><title>Lemon Inc.</title><meta charset='utf-8'></head>\r\n" //Buggy Firefox sends 2 requests!
            "<body style='background-color:#000000;color:#fff'>\r\n"
            "<h1 style='font-size:48;color:#ffff33'>Lemon Inc.</h1>\r\n"
            "<p>Your browser sent me following details:</p>\r\n"
            "<div style='text-align:left;width:50%%;margin:auto;'>\r\n"
        );
        send(client_s,response,strlen(response),0);

        int sent=send(client_s, buf , rcvd,0);

        sprintf(response,"</div><br>Number of connections today: %d\r\n",++n);
        send(client_s, response , strlen(response), 0);

        sprintf(response , "</body></html>\r\n");
        send(client_s, response , strlen(response), 0);

        close(client_s);
   }
   //close
   close(server_s);
   printf("\n\nServer shutdown\n");
   return 0;
}

Conclusions

I've been adding a conclusions section in my posts as of lately. Not sure what to conclude yet...

I do distrust projects like Mozilla Firefox that have fancy webpages but then deliver their code requiring a combination of software management tools or languages just for getting that code and setting up the developing software.

In this case those are Homebrew, Python and Mercurial...and their own "bootstrapping script".

And then there goes that script and tells you to install Mercurial via pip3 install Mercurial, while the webpage docs of Mozilla mention to use Homebrew for installing that VCS...

I followed the script instructions as that was running right there in my terminal.

Well, it turns out, it's wrong. Mercurial is expected in a cellar under the Homebrew installation on your Mac.

I'm not going to bother more with Firefox. I was curious to see how complicated it would be to tweak the networking code. But it's far too much trouble: I don't feel comfortable not knowing exactly where some stuff is being installed, how easy it is to get rid of or to what extend it might mess up some system-wide configurations. Ok, you can find that out eventually. But it's not that I constantly pip install things. So it bothers me. Now I have to clean my laptop from all that was installed...

The attitude towards the bug itself as shown in those bugzilla reports and this way of delivering a development framework, that doesn't actually work, is a bit of paradigmatic of these times of fancy high-level programming in Python, Ruby, Perl....

Web Servers and the IoT

The World of embedded Webservers

This is quick write-up of some of the things that are keeping me busy lately.

In writing some hands-on notes on networking, using as an example a server providing HTML content, I realized there might a potential market for something I called in my head Aweb.

Why that name? I was looking for a wordplay on webapp: a Aweb is an app that provides a webpage! ;-)

This would be but a micro/nano-webserver serving static HTML content that eventually could be even hard-coded.

Such Aweb would be extremely-narrowly dedicated. By that I mean, think of a normal webserver, say Apache, or Nginx, that would be serving only one file, or only from one folder.

It would also have a minimal storage and memory footprint...That's it! I realized then where such a thing could be useful...

Of course, embedded systems of the Internet of Things (IoT)!

Before I dig myself further into writing such a minimal webserver, I figured I look around what's already out there.

It turns out, there is a niche business for it!

But these micro/nano-webservers have a bunch of non-trivial, unique challenges. The only Wikipedia page I found is this one in french, but it gives a nice overview.

Lost of flexibility:

They are usually written in Assembler or C. Not surprising, and that's fine, I'm writing these servers in C myself.
The reuse of code is practically inexistent. This goes with the extremely targeted nature of each of these embedded servers.
Portability is completely forgotten. Ditto.
The skill requirements in programming expertise are very high. I hope my upcoming hands-on on networking helps you at least with the part on networking concepts involved. :-)

Cost sensitive:

"The purpose of this work is to create cost-sensitive, reliable technology that enablesWeb access to distributed measurement/control systems". See here.

Performance:

For IoT to be successful such that people adopt it both devices as well as software must be fast enough do deliver the usability required. These applies to such a micro webserver: it has be capable of handling many simultaneous connections. Due to the resource limitations of the devices such a server is running one, this is not a trivial challenge. Power consumption is one such challenges.

Some Solutions

So, what is then an industry, typical such IoT hardware?

An example could be an 8-bit processor with 40KiB memory -incidentally, that reminds me the Zilog81 in the form of a Spectrum48K we had at home in my childhood: 16KiB of ROM, 48KiB of RAM!

Surprisingly enough, at least to me, efforts have been devoted to come up with an embedded, underlying, specially-tailored Java virtual environment. Such JVM would provide the missing portability.

Some embedded systems use Jini, originally a Sun Microsystem's project was transferred to Apache under the name of River.

Another Java framework used in mobile systems is J2ME, the Java Micro Edition. targeting systems with less than 100KiB of memory.

So what are some solutions out there?

If one can have access to similar resources a typical OS provides, such a webserver can be designed not too different from what we may do using C.

However, if the limitations of the physical device are too tight, that may not work. In these cases such an Aweb and the OS functionalities must be blend into the same, unique software.

With an OS

Contiki: From the Wikipedia page: "Extant uses for Contiki include systems for street lighting, sound monitoring for smart cities, radiation monitoring, and alarms.[...] It provides multitasking and a built-in Internet Protocol Suite (TCP/IP stack), yet needs only about 10KiB of RAM and 30 of ROM. A full system, including a graphical user interface, needs about 30 kilobytes of RAM!!". Amazing!!
TinyOS: "operating system and platform for low-power wireless devices, such as those used in wireless sensor networks (WSNs), smartdust, ubiquitous computing, personal area networks, building automation, and smart meters. It is written in nesC, a subset of C optimized for the memory limits of sensor networks. TinyOS has been used in space, being implemented in ESTCube-1". Cool!

Without an OS

That Wikipedia page only mentions SMEWS (Smart Mobile Embedded Web Server). It doesn't offer more details -except that the main developer was a co-founder of Contiki.

The TCP/IP Stack

Here is another interesting piece.

The integration of the TCP/IP stack in one such small devices isn't evident. This network and transport layer was designed in the 80's free of the limitations these new devices come with. As such a common Linux system uses of the order of several 100KiB!!

It seems the solutions go then through stripping of the usual TCP/IP stack from parts not strictly necessary in the context of IoT, or at leas that one can do without.

A guy named Tao Lin did that by removing the time-out timer of TCP. As you know, when communicating through TCP, packets for which no acknowledgment is received -within a determined time- are resent. That's at the heart of the reliability achieved by the TCP protocol. Lin hard-coded a time and considered that a packet not acknowledged within 6 secs is never going to be so and can thus be forgotten. This removes the need to resend it.

That's interesting. Somehow, it brings us closer to an UDP protocol. In some sense, we choose the way and content of what we want to communicate. The underlying assumption is that packets are rarely lost -and if they are, we suck it up and live with it.

Notice that that's what's done with video and audio: if the image freezes for sec...you really don't care to see later what the second you missed before!

This means, these kind of shortcuts drive the solutions for IoT, and hence, the whole technology towards a real-time experience, and it makes sense I guess. It's also part of our time where everything is geared towards immediacy.

Embbeded TCP/IP implementations are :

µC/TCP-IP
openTCP
lwIP : Aside from the TCP/IP stack, lwIP has several other important parts, such as a network interface, an operating system emulation layer, buffers and a memory management section. The operating system emulation layer and the network interface allow the network stack to be transplanted into an operating system, as it provides a common interface between lwIP code and the operating system kernel.

These require of the order or several 10KiB and some kind of underlying OS. (Did you notice? Somebody else also thought of the "micro" prefix. ;-). However, there are other solutions that bypass that latter requirement and access the hardware directly.

The limitation on resources means that these solutions deal only with one segment at a time and

do not manage

network congestion management
sorting of incoming packets
IP fragmentation
urgent TCP data packets.

Specific examples

An example is WebCard, a web server for JavaCard:

Only considers packets to port 80
Ignores any TCP option
Only uses HTTP1.0 not ver. 1.1
Only implements the GET method.
It doesn't handle IP fragmentation: it is assumed the GET request fits and arrives in one single packet!

I hadn't thought on IP fragmentation yet. But I was conceiving Awebs as offering only the GET mechanism indeed.

Another example is MiniWeb, which is a proof of concept. It uses 24-28Bytes of RAM for the executable code and C source is a mere 400 lines long. The author doesn't mean it as more than what it is though: "It is important to note that this approach of course is totally and utterly wrong; a system using this approach should not be connected to an actual network, let alone the global Internet!".

It is of course limited compare to usual web servers: it handles only one connection at a time. Obviously, it doesn't offer dynamic content either.

Additional technological shortcuts are, e.g., client-side execution, as is the idea of AJAX, or co-design of IoT system and software.

Another project is Ethernut: This is an Open Source Hardware and Software Project for building tiny Embedded Ethernet Devices. Firmware, hardware and (programming) tools are all developed within the same project.

This wikipedia page is a trove on low-level IoT technology. It ends with a table of examples of different embedded webservers. Notice that in french, a byte is called an octet. Thus they say kilo-octet (Ko) instead of kilobyte (KiB).

Conclusions

The low-level world of IoT is fascinating and fun. At least I'm realizing that, now that I'm writing those hands-on examples of networking code.

Maybe I have to chance to have a closer look into one of those projects.

Among those apparently more developed and/or not abandoned seem to be

lwIP
Ethernut
Contiki

at least they are the ones that got my attention. Ok, two more did as well, but I digressed already enough on this topic for me to devote it more time. Anyway, I'm curious about microTCP and openTCP mentioned above.

Monday, October 26, 2020

Concurrent Programming

What is Concurrent Programming?

(You may find all the code as well as this whole post in this GitHub repository)

I'm trying to come up right now with an example of a common device that doesn't use modern multi-core processors...uhm, phones...no...tablets...no...Ipod touch...no, smartwatches...no -at least not the big-league ones; Fitbit ones do, so do as well any ipod nano...

I think you get the idea though: if it looks toylike, it has a single-core processor; otherwise, it's multi-core.

In this post I want to address the following question at a introductory level:

How can we take advantage in our code of those multiple cores?
Answers: By having our program spawn multiple children processes that may do the main calculation and letting the parent process collect those results. In other words, by using concurrent programming!

A basic way of achieving this is using the fork operation available in all *NIXs (OSX, Linux, FreeBSD...).

A different way of taking advantage of multiple cores is through the use of threads, e.g, pthreads. In linux there is no difference -see the reference by Linux Torvald further down; in Windows, there is no fork, only threads.

In Linux there is even a third way, closely related to fork, but that gives you a much more fine-grained level of control: clone and clone3. See the man pages. It is this call, in either version, clone/clone3, in Linux that really blurs the historical distinction between process and threads making both just but two ends of a spectrum of possible concurrent programming models. All three calls are part of the same family of system calls. This family includes as well other calls.

Index:

The Multi-Process, aka. (ehem) Parallel, aka. Concurrent "Hello World!"
What just happened?
Non-determinism
Multiple Children
Chaperoning your Children
What have you done my child?
Matrix Multiplication
When is a matrix big enough to justify the use of concurrent programming
1T (trillion) operations
Linux Memory Types
Simple Improvement
COW in action
Conclusions
References

The Multi-Process / Parallel / Concurrent "Hello World!"

Let's see an example in C:

//helloworld-mp.c

#include <stdio.h>      //printf
#include <unistd.h>     //fork
#include <stdlib.h>     //exit

int main(){
    pid_t pid = fork();

    switch( pid ) {
        case -1:
            printf("ERROR: spawning of child process failed\n");
            return 1;
            break;
        case 0:
            printf("Hello World! I'm the child whose process id pid=%d... Uhm...odd, yeah...\n"\
                    "anyway, my parents know my pid for sure. Ask them!\n",pid);
            exit(0);
            break;
        default:
            printf("Hi, I'm the parent of the process with pid=%d\n",pid);
            break;
    }

    return 0;
}

Compile as gcc helloworld-mp.c and run as ./a.out

The output should look like this

Hi, I'm the parent of the process whose pid=11518
Hello World! I'm the child with process id pid=0... Uhm...odd, yeah...
anyway, my parents know my pid for sure. Ask them!

What just happened?

The call to fork basically generates a copy of the whole program and its state up to that point -except for file descriptors, which are shared, and the value of the pid variable, which differs:

During the program execution

the child, any child, sees a value of pid == 0.
The parent process sees a pid value that varies from child to child. That pid value seen by the parent is the actual PID the kernel assigned to the child process.

The memory pages of our program are likely duplicated. I say likely because the method used is copy-on-write (COW). Basically this works as follows:

initially things aren't duplicated at all (except details like the pid variable, say). The child process has access to any variable as would do any normal code of ours.
However, if and when the child tries to modify a common variable, the kernel copies that variable to the child memory page.

The children are assigned a PGRP (group process id) equal to the PID of the parent.

As the virtual pages of different processes are independent, by default those modifications done by the child process cannot be seen by the parent; and vice-versa.

You can find more details at this excellent stackoverflow thread, or this comment of Linus Torvald on the way Linux deals with processes and threads.

Non-determinism

In concurrent programing execution is in general non-deterministic. This means, we cannot anticipate the order in which the different child process will be processed and, hence, finish their tasks. Thus, you may have seen the above output with the child's greeting first. That's more likely so the more children your program spawns and the more busy your computer is running apps in the background -browsers with several tabs, mail program, itunes...

This fact is the source of additional headaches when debugging a concurrent program.

Multiple Children

Just do that same fork as many times as you need. Here we will use a loop.

#include <stdio.h>      //printf
#include <unistd.h>     //fork
#include <stdlib.h>     //exit

#define NUMBER_CHILDREN 8

int main(){
    for ( int i= 0 ; i < NUMBER_CHILDREN ; i++){
        pid_t pid = fork();

        switch( pid ) {
            case -1:
                printf("ERROR: spawning of child process failed\n");
                return 1;
                break;
            case 0:
                printf("Hello World! I'm the child number %d. Bye!\n",i+1);
                exit(0);
                break;
            default:
                printf("Hi, I'm the parent of the process with pid=%d\n",pid);
                break;
        }
    }
    return 0;
}

Compile and run. You should see an output similar to this:

Hi, I'm the parent of the process with pid=11607
Hi, I'm the parent of the process with pid=11608
Hello World! I'm the child number 1. Bye!
Hello World! I'm the child number 2. Bye!
Hi, I'm the parent of the process with pid=11609
Hello World! I'm the child number 3. Bye!
Hi, I'm the parent of the process with pid=11610
Hello World! I'm the child number 4. Bye!
Hi, I'm the parent of the process with pid=11611
Hello World! I'm the child number 5. Bye!
Hi, I'm the parent of the process with pid=11612
Hello World! I'm the child number 6. Bye!
Hi, I'm the parent of the process with pid=11613
Hello World! I'm the child number 7. Bye!
Hi, I'm the parent of the process with pid=11614
Hello World! I'm the child number 8. Bye!

Run it several times. You'll see what I meant by non-determinism: The order in which those print statements are executed varies from run to run.

Chaperoning your children

Any decent parents will want to look after their children. I know you do. The child's PID allows the parent some level of targeted interaction, like...well, killing it. 8-/

Another thing that parents often do, is wait for their children to come back after they rushed to the playground and played for a while.

We can do this the following way:

#include <stdio.h>      //printf
#include <unistd.h>     //fork
#include <stdlib.h>     //exit

#define NUMBER_CHILDREN 8

int main(){
    pid_t pid;
    for ( int i= 0 ; i < NUMBER_CHILDREN ; i++){
        pid = fork();

        switch( pid ) {
            case -1:
                printf("ERROR: spawning of child process failed\n");
                return 1;
                break;
            case 0:
                printf("Hello World! I'm the child number %d. Bye!\n",i+1);
                exit(0);
                break;
            default:
                printf("Hi, I'm the parent of the process with pid=%d\n",pid);
                break;
        }
    }
    printf("I'm the parent waiting for all my %d offspring to come back and report\n",NUMBER_CHILDREN);
    int status;
    while( (pid=wait(&status)) > 0) {
        printf("My child with pid=%d is back!\n",pid);
    }
    printf("All children returned\n");
    return 0;
}

Compile as usual and run it. You should see something like this:

Hi, I'm the parent of the process with pid=11685
Hi, I'm the parent of the process with pid=11686
Hello World! I'm the child number 1. Bye!
Hello World! I'm the child number 2. Bye!
Hi, I'm the parent of the process with pid=11687
Hi, I'm the parent of the process with pid=11688
Hello World! I'm the child number 3. Bye!
Hello World! I'm the child number 4. Bye!
Hi, I'm the parent of the process with pid=11689
Hi, I'm the parent of the process with pid=11690
Hello World! I'm the child number 5. Bye!
Hello World! I'm the child number 6. Bye!
Hi, I'm the parent of the process with pid=11691
Hi, I'm the parent of the process with pid=11692
I'm the parent waiting for all my 8 offsprings to come back and report
Hello World! I'm the child number 7. Bye!
My child with pid=11690 is back!
My child with pid=11689 is back!
My child with pid=11688 is back!
My child with pid=11687 is back!
My child with pid=11686 is back!
My child with pid=11685 is back!
Hello World! I'm the child number 8. Bye!
My child with pid=11691 is back!
My child with pid=11692 is back!
All children returned

The function wait(int*) here is a wrapper around wait4(pid_t,int*,int, struct rusage*), equivalent to wait4(-1,&status,0,NULL). See man 2 wait.

wait(int*) basically waits -that is, blocks execution!- until any of the children terminates -or is killed- and then it returns the child PID if a child terminated or -1 if there is no more children.

Actually, it returns -1 as well if there was a problem. But you can catch an error value and double check what happened. I'm trying to keep things as simple as possible to get you started though.

Therefore, by looping while wait returns a positive value effectively we are waiting till each and every child has ended its execution.

The variable status contains information of how the child processes terminated its execution -did it really finished as expected? whas it terminated because it received a (e.g. kill) signal? etc.

What have you done out there my child?

The following code waitforallchild.c provides a more detailed example on what information can be gathered on the child's execution.

It has the child calculating a large Fibonacci number. The code for this is in the header fibo.h listed afterwards:

//waitforallchild.c

#include <stdio.h>
#include <sys/wait.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>

#ifdef FIBO
#include "fibo.h"
#ifndef FIBON
#define FIBON (fibo_t) 5*1000*1000*1000
#endif
#define childwork() printf("fib(%llu)=%llu\n",FIBON,fib(FIBON));
#define BUSY "computation lasted"
#else
#define childwork() nanosleep(&timeout,NULL);
#define BUSY "slept for"
#endif

#define NUMBER_CORES 8
#define CHILD_SLEEP 10*NUMBER_CORES
#define MAXCHILDREN 100

int main(int argc, const char** argv){
   time_t begin = time(NULL);
   int nch = MAXCHILDREN/25;

   if( argc > 1 && strncmp(argv[1],"-",1)==0 ){
       printf("Usage: %s [number_of_children (%d)]\n\n"\
           "Spawns number_of_children and waits for them to finish before wrapping up.\n"\
           "Max number allowed is %d. If number_of_children is larger it automatically reset to that max value.\n",\
           argv[0], nch,MAXCHILDREN\
       );
       exit(0);
   }
   if (argc > 1){
       nch = atoi(argv[1]);
       nch = (nch<0||nch>MAXCHILDREN)? MAXCHILDREN*(nch>0?10:1)/10 : nch ;
   }

   pid_t* pids = (pid_t*) calloc(nch,sizeof(pid_t));

   struct timespec timeout;
   timeout.tv_sec = CHILD_SLEEP/nch;
   timeout.tv_nsec = (CHILD_SLEEP%nch)*100*1000*1000;
   printf("Will try spawning %d children each living for %4.f sec\n",nch,timeout.tv_sec+1e-9*timeout.tv_nsec);

   pid_t pid , sumOfpids = 0;
   int i , n=0, status;
   for( i= 0 ; i < nch ; i++){
       pid = fork();
       n++;
       switch(pid){
           case -1:
               printf("WARNING: Error while forking a child. Will try to continue with remaining %d children.\n",nch-n);
               n--;
               continue;
               break;
           case 0:
               begin = time(NULL);
               printf("Hi, I'm child %d. Going to sleep now\n",n);
               //printf("fib(%llu)=%llu\n",FIBON,fib(FIBON));
               //nanosleep(&timeout,NULL);
               childwork();
               printf("Child %d %s %tu sec. Bye\n",n,BUSY,time(NULL)-begin);
               exit(0);
               break;
           default:
               sumOfpids += pid;
               printf("Parent: saw child %d (%d) parting\n",n,pid);
               break;
       }

   }

   printf("All children submitted. Waiting for their termination...\n");

   /*
   //all theese are equivalent ways for waiting FOR ALL children -caveat: if a child stops (but not terminating) waitn returns a >0 value too!
   //while( waitpid(0,&status,0)>0 );
   //while( wait4(-1,&status,0,0)>0 );
   //while( wait3(&status,0,0)>0 );
   //while( wait(&status)>0 );

   while( sumOfpids > 0 ){
       pid = wait(&status);
       if( pid == -1 ){
           printf("Error in child or while waiting for one\n");
           exit(1);
       }
       sumOfpids -= pid ;
       printf("Child %d finished\n",pid);
   }
   */
   //wait and check exit status
   int options = WUNTRACED;
   while( (pid=wait3(&status,options,0))>0){
       int wexstat;
       if ( WIFEXITED(status) ){
           wexstat=WEXITSTATUS(status);
           printf("Child %d terminated with an exit call %d\n",pid,wexstat);
       }
       if ( WIFSIGNALED(status)) {
           wexstat=WTERMSIG(status);
           printf("Child %d terminated with a term signal %d\n",pid,wexstat);
           if( WCOREDUMP(status) ) printf("Coredump was created\n");
       }
       if ( WIFSTOPPED(status)) {
           wexstat=WSTOPSIG(status);
           printf("Child %d stopped with signal %d\n",pid,wexstat);
       }
   }
   printf("All %d children finished.Elapsed time aprox. %tu sec.\n",n,time(NULL)-begin);

   return 0;
}

The fibo.h header :

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

#ifndef DEBUG
#define DEBUG 0
#endif

typedef long long unsigned int fibo_t;

fibo_t fib(fibo_t n){
        fibo_t f, tmp1, tmp2;
        switch(n){
                case 0:
                        return 0;
                        break;
                case 1:
                case 2:
                        return 1;
                        break;
                default:
                        tmp2 = (fibo_t) 1;
                        tmp1 = (fibo_t) 1;
                        for(fibo_t i=2; i < n ; i++){
                                f = tmp1 + tmp2;
                                if(DEBUG)printf("#DEBUG: fib(%llu)=%llu+%llu=%llu\n",i+1,tmp1, tmp2, f);
                                tmp2 = tmp1 ;
                                tmp1 = f;
                        }
                        return f;
        }
        exit(-1);
}
f

Compile as gcc -o waitforallchild waitforallchild.c -DFIBO then run as ./waitforallchild

You should see something like this:

Will try spawning 4 children each living for 20 sec
Parent: saw child 1 (11829) parting
Parent: saw child 2 (11830) parting
Hi, I'm child 1. Going to sleep now
Parent: saw child 3 (11831) parting
Hi, I'm child 2. Going to sleep now
Parent: saw child 4 (11832) parting
All children submitted. Waiting for their termination...
Hi, I'm child 3. Going to sleep now
Hi, I'm child 4. Going to sleep now
fib(5000000000)=10859268830657044933
Child 1 computation lasted 20 sec. Bye
Child 11829 terminated with an exit call 0
fib(5000000000)=10859268830657044933
Child 2 computation lasted 20 sec. Bye
Child 11830 terminated with an exit call 0
fib(5000000000)=10859268830657044933
Child 3 computation lasted 20 sec. Bye
Child 11831 terminated with an exit call 0
fib(5000000000)=10859268830657044933
Child 4 computation lasted 20 sec. Bye
Child 11832 terminated with an exit call 0
All 4 children finished.Elapsed time aprox. 20 sec.

Now open a new terminal window. We will run it again but this time kill one of the children while still computing the fibonacci number. In this example I issued a kill -9 11837 and the output is:

Parent: saw child 1 (11836) parting
Hi, I'm child 1. Going to sleep now
Parent: saw child 2 (11837) parting
Hi, I'm child 2. Going to sleep now
Parent: saw child 3 (11838) parting
Parent: saw child 4 (11839) parting
All children submitted. Waiting for their termination...
Hi, I'm child 3. Going to sleep now
Hi, I'm child 4. Going to sleep now
Child 11837 terminated with a term signal 9
fib(5000000000)=10859268830657044933
Child 4 computation lasted 20 sec. Bye
Child 11839 terminated with an exit call 0
fib(5000000000)=10859268830657044933
Child 3 computation lasted 20 sec. Bye
Child 11838 terminated with an exit call 0
fib(5000000000)=10859268830657044933
Child 1 computation lasted 20 sec. Bye
Child 11836 terminated with an exit call 0
All 4 children finished.Elapsed time aprox. 20 sec.

Matrix Multiplication

Multi-linear functions, aka. Matrices are a domain of Linear Algebra that can easily benefit from using multiple cores.

One example is matrix multiplication.

Let's denote a matrix A with n rows and m columns as A(n,m).

A matrix A(n,k) can only be multiplied (on the right) with a matrix B(k,m). The resulting matrix will have n rows and m columns. Lets denote it as C(n,m).

Thus C has n*m components. Each of these can be calculated independently as follows: row i, column j of C is given by

C(i,j) = product, element by element, of row i of A times column j of B and then adding all those k results = A(i,1)*B(1,j) + A(i,2)*B(2,j) + ... + A(k,1)*B(k,j)

That is, the determination of each element of C involves 2k-1 operations: k multiplications plus k-1 addtions. Hence, in total the matrix multiplication requires n*m*(2k-1) operations.

In particular, calculating the square of a matrix A(n,n) involves n*n*(2n-1) ~ 2n^3 operations.

The following code example does this multi-threaded, or rather, multi-process matrix multiplication. It relies on the header mtxmult_mp.h shown afterwards.

//test_mtxmult_mp-big.c
#include "mtxmult_mp.h"
#include <math.h>
#include <string.h>

int main(int argc, char** argv){
   size_t nbproc = NUM_PROCESS;
   if( argc < 3){
        printf("Usage: %s nrows ncols [ nbproc (%zu) ] [-ns]\n\n-ns\t\tno single-thread\n",\
            argv[0],nbproc);
        return -1;
   }
   size_t nrows = (size_t) atol(argv[1]);
   size_t ncols = (size_t) atol(argv[2]);
   if( argc > 3 ){
        nbproc = (size_t) atol(argv[3]);
   }
   int skip_single_thread=0;
   if ( argc > 4 && strcmp("-ns",argv[4])==0){
        skip_single_thread=1;
   }
   printf("Max # threads: %zu\n",nbproc);
   printf("Random array size %zux%zu\n",nrows,ncols);
   //orig matrix in shared memory makes no difference due to COW and being
   //used just for reading by children
   //double* A = mmap(NULL,nrows*ncols*sizeof(double),PROT_READ | PROT_WRITE,
   //                                        MAP_SHARED | MAP_ANONYMOUS, -1, 0);
   double *A2, *A = (double*) calloc(nrows*ncols, sizeof(double));
   srand(1234567);
   for(size_t i = 0 ; i < nrows*ncols ; i++){
        A[i] = floor((10.0*rand())/RAND_MAX) ;
   }
   const size_t asz[2]={nrows,ncols};
   pmtx(A,nrows,ncols,"A");
   time_t begin ;
   if( !skip_single_thread ){
        printf("Single thread\n");
        begin = time(NULL);
        //pmtx( mtxsq(A,asz),nrows,ncols,"A^2");
        A2=mtxsq(A,asz);
        printf("Time: %zu sec.\n",time(NULL)-begin);
        pmtx(A2,nrows,ncols,"A^2");
   }
   begin = time(NULL);
   //pmtx( mtxsq_thr(A,asz,nbproc),nrows,ncols,"A^2");
   A2=mtxsq_thr(A,asz,nbproc);
   printf("Time: %zu sec.\n",time(NULL)-begin);
   pmtx(A2,nrows,ncols,"A^2");
   return 0;
}

The actual concurrent calculation of the matrix multiplication is done by the code in mtxmult_mp.h that follows. As this example intends to show the difference between concurrent and sequential calculation of the square of a matrix, this code in turn requires the (sequential) version found in the header mtxmult.h that is listed afterwards.

//mtxmult_mp.h
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <time.h>
#include <sys/mman.h>

#include "mtxmult.h"

#define NUM_CORES 4
#define NUM_PROCESS (size_t)(NUM_CORES-1)
#define DEBUG 0

//
//Threaded ( via fork() ) version of matrix multiplication

double* mtxm_thr(double* a,const size_t asz[2],double* b,const size_t bsz[2], size_t nbproc){
   printf("Threaded Matrix Multiplication start...%zu threads\n",nbproc);
   if ( asz[1] != bsz[0] ){
        printf("ERROR: can't multiply A*B: columns of A %zu != rows of b%zu\n",asz[1],bsz[0]);
        exit(1);
   }
   //result matrix in shared memory
   double* ab = mmap(NULL,asz[0]*asz[1]*sizeof(double),PROT_READ | PROT_WRITE,
                                            MAP_SHARED | MAP_ANONYMOUS, -1, 0);
   if( ab == MAP_FAILED){
        printf("ERROR: Couldn't allocate shared memory for matrix multiplication");
        exit(2);
   }

   size_t i,j,k, mop, mopo, pid, children=0, fchildren=0;
   size_t mpt = (asz[0]*bsz[1])/nbproc;
   size_t rmm = (size_t) ((asz[0]*bsz[1])%nbproc);
   if( mpt == 0){
        nbproc=asz[0]*bsz[1];
        mpt = 1;
        rmm = 0;
        printf("I: Too large number of processes. Reset to %zu\n",nbproc);
   }
   size_t* fchildren_ptr = &fchildren;
   mopo=rmm;
   int pd[2];
   pipe(pd);

   while ( children < nbproc - 1 ) {
        children++;
        mopo += mpt;
        pid = fork();
        if ( !pid ){
            for(mop= mopo ; mop < (mopo+mpt) ; mop++){//
                i = (size_t) (mop/bsz[1]);
                j = (mop%bsz[1]);
                for(k=0; k<asz[0] ; k++){
                    ab[i*asz[1]+j] += a[i*asz[1]+k] * b[k*bsz[1]+j] ;
                }
                if( DEBUG ) printf("#DEBUG: child %zu : ab(%zu,%zu)=%.4f\n",children,i,j,ab[i*asz[1]+j]);
            }
            (*fchildren_ptr)++;

            if( write(pd[1], fchildren_ptr, sizeof(size_t)) == -1 ){
                printf("ERROR: child %zu couldn't write to parent\n",children);
                exit(3);
            }
            //return 0;
            exit(0);
        }
   }
   if( DEBUG ) printf("#DEBUG: Parent processing... mop[0, %zu)\n",rmm+mpt);
   for(mop=0 ; mop < rmm+mpt ; mop++){
        i = (size_t) (mop/bsz[1]);
        j = (mop%bsz[1]);
        for(k=0; k<asz[0] ; k++)
            ab[i*asz[1]+j] += a[i*asz[1]+k] * b[k*bsz[1]+j] ;
        if( DEBUG ) printf("#DEBUG: parent : ab(%zu,%zu)=%.4f\n",i,j,ab[i*asz[1]+j]);
   }

   struct timespec timeout;
   timeout.tv_sec = 1;
   timeout.tv_nsec = 10*1000;

   while( fchildren< nbproc-1 ){
        if( DEBUG ) printf("#DEBUG: checking... %zu(%zu) children finished\n",fchildren,nbproc-1);
        nanosleep(&timeout,NULL);
        if( read(pd[0],&children,sizeof(size_t)) == -1 ){
            printf("ERROR: parent : reading pipe \n");
            exit(4);
        }
        fchildren += children;
   }
   if( DEBUG ) printf("#DEBUG: %zu(%zu) children finished\n",fchildren,nbproc-1);
   return ab;
}
double* mtxsq_thr(double* m, const size_t msz[2], size_t nbproc){
   return mtxm_thr(m,msz,m,msz,nbproc);
}

And finally here the linear matrix multiplication code of mtxmult.h

//mtxmult.h
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
void pmtx(double* m, size_t rs, size_t cs,char* txt){
   if( rs*cs <= 0 ) {
        fprintf(stderr,"Warning: ill-defined matrix dimensions\n");
        return;
   }
   if( txt ) printf("%s\n",txt);
   size_t i,j ;
   for(i=0 ; i<rs ; i++){
        for(j=0 ; j<cs ; j++){
            printf("%.4f\t",m[i*cs+j]);
            if( j == 10) {
                j = cs;
                printf("...");
            }
        }
        printf("\n");
        if( i == 10){
            i = rs;
            printf("...\n");
        }
   }
}
double* mtxm(double* a,const size_t asz[2],double* b,const size_t bsz[2]){
   if ( asz[1] != bsz[0] ){
        printf("ERROR: can't multiply A*B: columns of A %zu != rows of b%zu\n",asz[1],bsz[0]);
        exit(1);
   }
   double* ab = calloc(asz[0]*bsz[1],sizeof(double));
   if( !ab ){
        printf("ERROR: Couldn't allocate memory for matrix multiplication");
        exit(2);
   }
   size_t i,j,k;
   for(i=0 ; i<asz[0] ; i++){
        for(j=0 ; j<bsz[1] ; j++)
            for(k=0; k<asz[0] ; k++)
                ab[i*asz[1]+j] += a[i*asz[1]+k] * b[k*bsz[1]+j] ;
   }
   return ab;
}
double* mtxsq(double* m, const size_t msz[2]){
   return mtxm(m,msz,m,msz);
}

In a Mac, compile this simply as gcc -o test_mtxmult_mp-big test_mtxmult_mp-big.c.

In Linux you'll likely have to hint the linker to the right math library with -lm, thus compile it as gcc -o test_mtxmult_mp-big test_mtxmult_mp-big.c -lm.

Do a first test by running ./test_mtxmult_mp-big 3 3 You should get

Max # threads: 3
Random array size 3x3
A
6.0000    2.0000    3.0000
3.0000    8.0000    7.0000
7.0000    9.0000    9.0000
Single thread
Time: 0 sec.
A^2
63.0000    55.0000    59.0000
91.0000    133.0000    128.0000
132.0000    167.0000    165.0000
Threaded Matrix Multiplication start...3 threads
Time: 2 sec.
A^2
63.0000    55.0000    59.0000
91.0000    133.0000    128.0000
132.0000    167.0000    165.0000

You can see that both calculations of the square of matrix A coincide. That's what we want to see.

But you may also notice that the multi-process calculation takes slightly more to complete!

There is an overhead the kernel incurs in setting up everything for us to use multiple processes. For small matrices that's actually a noticeable burden and a single process fares better.

When is a matrix big enough to justify the use of concurrent programing?

The above program test_mtxmult_mp-big creates a random square matrix of the size we specify in the command line and calculates its square both sequentiall and concurrently with as many child processes as specified. Notice that the parent also takes up its share of work. Thus using 7 children actually means splitting the work among 8 concurrent processes that do a calculation!

Suprisingly, it takes quite a large matrix to start seeing a benefit. The actual size will depend on your machine's specs.

In this first result I'm using a MacbookPro with the following specs

$system_profiler SPHardwareDataType

      Model Name: MacBook Pro
      Model Identifier: MacBookPro9,1
      Processor Name: Quad-Core Intel Core i7
      Processor Speed: 2.3 GHz
      Number of Processors: 1
      Total Number of Cores: 4
      L2 Cache (per Core): 256 KB
      L3 Cache: 6 MB
      Hyper-Threading Technology: Enabled
      Memory: 16 GB
      Boot ROM Version: 233.0.0.0.0
      SMC Version (system): 2.1f173
      Serial Number (system): CH348DJ39J4
      Hardware UUID: 723078B1-962D-3749-8B77-1D1D857678B1
      Sudden Motion Sensor:
      State: Enabled

$sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz

This table summarizes the results

Size | Single thread | MP 8 threads

1000 | 11 sec | 10 sec

3000 | 400 sec | 107 sec

Thus up to a 1000*1000 matrix it isn't worth bothering with multi-processing. That's roughly 2 billion operations where a single cores competes well against 8. Impressive.

For a matrix of size 3000*3000 the difference is however clear: MP using 8 processes has a clear advantage by being roughly 4 times faster!

For this latter case, the following is a snapshot of resource usage as seen by top:
Processes: 455 total, 10 running, 445 sleeping, 1897 threads                       01:55:23
Load Avg: 5.47, 2.85, 2.42 CPU usage: 97.98% user, 1.88% sys, 0.12% idle
SharedLibs: 315M resident, 53M data, 38M linkedit.
MemRegions: 103199 total, 3978M resident, 128M private, 1540M shared.
PhysMem: 10G used (2072M wired), 5932M unused.
VM: 2610G vsize, 1994M framework vsize, 25690086(0) swapins, 26376915(0) swapouts.
Networks: packets: 10311058/10G in, 10417725/8575M out.
Disks: 5873635/201G read, 3603239/189G written.

PID    COMMAND      %CPU TIME     #TH #WQ #PORT MEM    PURG CMPRS PPID PGRP STATE
12061 test_mtxmult 98.4 00:04.28 1/1 0    8     69M+   0B    0B    12056 12056 running
12058 test_mtxmult 97.5 00:04.28 1/1 0    8     69M+   0B    0B    12056 12056 running
12056 test_mtxmult 97.1 00:04.35 1/1 0    10    69M+   0B    0B    490 12056 running
12057 test_mtxmult 96.5 00:04.24 1/1 0    8     69M+   0B    0B    12056 12056 running
12063 test_mtxmult 96.4 00:04.25 1/1 0    8     69M+   0B    0B    12056 12056 running
12060 test_mtxmult 96.4 00:04.28 1/1 0    8     69M+   0B    0B    12056 12056 running
12059 test_mtxmult 96.4 00:04.25 1/1 0    8     69M+   0B    0B    12056 12056 running
12062 test_mtxmult 96.3 00:04.25 1/1 0    8     69M+   0B    0B    12056 12056 running

Notice what we mentioned above:

The parent process (highlighted in red) has a PID 12056 and a PPID of 490.
All children have a group process id PGRP (highlighted in green) equal to the PID of the parent, i.e., 1206.

We store the original matrix with 9*10^6 doubles (8 Bytes), that is, about 70MiB . Top reports 69MiB.

1T (trillion) Operations

In another box I have a Linux system with the following arch (issue lscpu)

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   36 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           58
Model name:                      Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
Stepping:                        9
CPU MHz:                         1596.518
CPU max MHz:                     3900.0000
CPU min MHz:                     1600.0000
BogoMIPS:                        6787.40
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        8 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, STIBP disabled, RSB filling
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

Calculations will be faster here simply due to the 1.5 times faster CPU clock.

Size   | Single thread | MP 8 threads
1000 |         10 sec     |     9 sec
3000 |       341 sec     |       84 sec
10000| 15972 sec |    7101 sec = 118min (w/o single-proc calc: 7040 sec = 117.3 min)

15972 sec = 4.44hrs or 4h 25min ! More than double the time required by our multitasking version!

I didn't want to leave my laptop running +4h. That's why I didn't provide the timing for the case of a 10,000*10,000 matrix. That's 100M elements requiring a total 1T (trillion or 1e9) operations!

As the elements are stored as doubles, such a matrix requires of the order of 10^8*8Bytes = 800MB = 763MiB. From the previous results we'd expect the single process to use up to ~1.5GiB and take ~10,000sec ~ 2.8hrs, likely though +3hrs, to complete.

A single thread starts occupying around 763MiB. Memory usage grows along the calculation though.

After an 1h 20min it's almost 1GiB:

top - 04:05:02 up 35 days, 5:18, 3 users, load average: 1.11, 1.11, 1.04
Tasks: 216 total,   2 running, 214 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15808.6 total,    695.3 free,   1960.8 used, 13152.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 13090.2 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
256940 admin     20   0 1566132 990.4m   1736 R 99.9   6.3 78:35.40 test_mtxmult_mp

After over 3hrs the memory footprint is 3 times larger

top - 06:00:41 up 35 days, 7:14, 3 users, load average: 1.00, 1.00, 1.01
Tasks: 216 total,   2 running, 214 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15808.6 total,    364.8 free,   2291.2 used, 13152.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 12759.7 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
256940 admin     20   0 1566132   1.3g   1736 R 99.8   8.3 194:05.29 test_mtxmult_mp

Even further, after 4hrs

top - 14:17:57 up 35 days, 15:31, 3 users, load average: 1.00, 1.00, 1.00
Tasks: 218 total,   2 running, 216 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15808.6 total,    225.9 free,   2428.4 used, 13154.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 12622.6 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
259259 admin     20   0 1566132   1.4g   1700 R 99.9   9.2 240:55.60 test_mtxmult_mp

Somewhat later it has reached the threshold of 1.5GiB

top - 14:29:57 up 35 days, 15:43, 2 users, load average: 1.25, 1.17, 1.08
Tasks: 213 total,   3 running, 210 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15808.6 total,    196.4 free,   2458.1 used, 13154.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 12592.9 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
259259 admin     20   0 1566132   1.5g   1700 R 99.9   9.4 252:54.93 test_mtxmult_mp

Finally, the single-core computation finishes and the multi-core one fires in

top - 14:49:41 up 35 days, 16:03, 2 users, load average: 8.03, 6.22, 3.50
Tasks: 220 total,   9 running, 211 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.6 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.3 hi, 0.0 si, 0.0 st
MiB Mem : 15808.6 total,    159.3 free,   2518.2 used, 13131.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 12488.5 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
260600 admin     20   0 2347384   1.5g   6968 R 99.7   9.7   6:46.09 test_mtxmult_mp
259259 admin     20   0 2347384   1.5g   8272 R 99.7   9.7 272:35.76 test_mtxmult_mp
260597 admin     20   0 2347384   1.5g   4592 R 99.6   9.7   6:45.67 test_mtxmult_mp
260601 admin     20   0 2347384   1.5g   4064 R 99.6   9.7   6:45.89 test_mtxmult_mp
260603 admin     20   0 2347384   1.5g   7232 R 99.6   9.7   6:45.78 test_mtxmult_mp
260599 admin     20   0 2347384   1.5g   6968 R 99.6   9.7   6:45.71 test_mtxmult_mp
260598 admin     20   0 2347384   1.5g   4328 R 99.4   9.7   6:44.97 test_mtxmult_mp
260602 admin     20   0 2347384   1.5g   4328 R 99.3   9.7   6:45.12 test_mtxmult_mp

Notice how the RES column shows now a huge value right from the start of this new, multi-threaded computation. Remember the fork spawns a clone of our program at the point of the call. As we just did the single-core computation, the parent had allocated 1.5GiB and it is a snapshot of that what's clone. Hence, this value.

See a more detailed account of memory usage

And a few minutes before finishing the calculation

Notice here the values of the RSsh column which end up on the scale of ~95MiB, 1/8 of the total size the matrix occupies. See below.

Linux Memory Types

A comment on Linux Memory Types is in place in other to clarfy a bit what we are reading there.

From the linux top man page:

       The following may help in interpreting process level memory values displayed as scalable columns and discussed under topic `3a. DESCRIPTIONS of Fields'.

          %MEM - simply RES divided by total physical memory
          CODE - the `pgms' portion of quadrant 3
          DATA - the entire quadrant 1 portion of VIRT plus all explicit mmap file-backed pages of quadrant 3
          RES - anything occupying physical memory which, beginning with Linux-4.5, is the sum of the following three fields:
                 RSan - quadrant 1 pages, which include any former quadrant 3 pages if modified
                 RSfd - quadrant 3 and quadrant 4 pages
                 RSsh - quadrant 2 pages
          RSlk - subset of RES which cannot be swapped out (any quadrant)
          SHR - subset of RES (excludes 1, includes all 2 & 4, some 3)
          SWAP - potentially any quadrant except 4
          USED - simply the sum of RES and SWAP
          VIRT - everything in-use and/or reserved (all quadrants)

       Note: Even though program images and shared libraries are considered private to a process, they will be accounted for as shared (SHR) by the kernel.

Thus it is

USED = RES + SWAP (0 here; not shown above)

RES = RSan + RSfd + RSsh

SWAP is that chunk of the process' memory that has been moved to a file because of too high a volume of memory requests by other tasks running in your computer.

RSan = Resident Anonymous Memory Size (KiB). Chunk of physical memory allocated to the process that is not linked to any file -that is, it's pure in RAM. This is a private (for the process'-own-eyes only) chunk of memory.

RSfd = Resident File-Backed Memory Size (KiB). Shared pages supporting program images and shared libraries. Also explicit file mappings both private and shared.

RSsh = Resident Shared Memory Size (KiB). Explicitly shared anonymous shm*/mmap pages.

Then we also have

CODE = Chunk (KiB) of physical memory currently used by the executable code, aka. Text Resisdent Set (TRS).

DATA = Data + Stack size (KiB). Memory reserved by the process, aka. Data Resident Set (DRS). It may not yet be mapped to physical (RES) memory. It's always included in VIRT though.

Simple Improvement

Therefore, the way we wrote our program fork is efectively cloning the memory of the parent process. That's on us.

While at fork time 1.5GiB are cloned and allocated into res memory to each child (RSan) its value stays constant throughout the whole calculation.

However, the USED value increases by roughly 100MiB. This corresponds to the increase in RSsh the shared anonymous resident memory allocated to each child. This value corresponds indeed to the size of the corresponding chunk of the end result matrix each child must compute.

As we allocated a shared memory for the resulting matrix (check the code for double* ab = mmap(...) ), the space for the result doesn't amount to an additional 760MiB for each process.

One way to optimize this calculation is to free the space occupied by the result of the single-process calculation. That would leave only the original matrix for cloning, ~760MiB.

Furthermore, we could avoid duplicating this data also. The way would be defining it as shared memory, initialize its values in the parent, and right then, make it read-only. I'll try to post the result another day.

But even without going into that trouble, simply allocating the initial matrix in a shared memory space (mmap) yields already some improvements -instead of a regular calloc (see the code). Check the following two top snapshots.

Both show the initial moments of a run with the 10,000*10,000 matrix and 8 threads, but skipping the single-thread calculation (option -ns ).

The first corresponds to allocating the initial matrix via the calloc() system call. It's size of ~760MiB is effectively cloned as anonymous resident memory space, RSan, to each of the threads

However, when allocating the initial matrix as shared memory space via the mmap() system call, things look more promising: The whole initial matrix is now allocated as RSsh, while the private RSan chunk is minimal!

COW in Action

Let's run this: ./test_mtxmult_mp-big 40000 40000 8 -ns. This is a square matrix of 40,000 rows and columns, hence 16 times larger than the previous example, thus roughly 11.9GiB.

First will do it allocating the initial matrix with a calloc call. That will put it under RSan and Data.

If fork would actually clone and copy this memory for each of the 8 processes, the system would run out of memory, as it only has 16GiB of RAM.

However, when we run it things do work. Furthermore, the physical used memory as given by top is 13368MiB, slightly above 12GiB. This is about 12400MiB more than the baseline (i.e., before starting the calculation). Notice the similarity with the memory required by the matrix itself.

This means that fork didn't really copy and duplicated all that data! It's only being read, so COW should not need to really duplicate -no write.

Incidentally, this shows that it's not meaningfull to add the RES value of different processes as that does not amount to the actual, physical combined memory used by all of them.

Let's repeat this case but allocating the initial matrix via mmap call.

Again we see the shift from almost all the memory being private (above) to almost all being shared.

However, an analogous conclusion can be drawn: It's not meaningful to add the values of USED for different processes; neither is it to add the values of RSsh: They do not account for independent chunks of physical memory; we are overcounting!

Additionally, notice that in this case the physical used memory reported by top is just a mere 1158MiB, a bit over 1.GiB, that's about 180MiB over the initial baseline which in this case was 974MiB.

Can we account for that? I don't know how.

Finally, notice that the used buff/cache mem has increased in this case by roughly 12000MiB, again of the order of the size of the initial matrix.

Conclusions

This post provides a first hands-on introduction to concurrent programming. You can use it for small projects on most OSs like OSX, Linux, FreeBSD -notably not Windows, which has no fork capabilties.

As an introduction it definitely doesn't cover all there is about a practical guideline to the topic.

For instance, passing information between parent and children could be done through pipes (very limited). We used a more sensible approach through memoring sharing (shm, mmap,etc.).

A better alternative is to fine tune the creationg of the children through the clone family of systems calls. These allow for selectively sharing things like the heap or stack.

When sharing memory among the parent and children, with read and write privileges for all, an important complication arises: race condtions. This means, as a programmer, must make sure that no two processes attempt to modify the same memory location at the same time, or forget to unlock on time a descriptor thus blocking a second processes from using it, etc. Semaphores is a keyword for digging into this.

This is their typical day-to-day life when people say they are using pthreads! ;-)

A high-performance, highly portable and scalable way of passing information among children and with the parent is through the use of an API initially developed in the academy and that became a standard in some scenarios: the MPI (Message Passing Interface) library. This relies on processes. There are ports to MPI in many languages (C, C++, Fortran, Java,...).

Coming back to our example, just with fork and memory sharing, we could have allocated the initial matrix via mmap (see the commented out lines in the code) and make it read-only right after populating its values. After all, this is a chunk of memory that only requires reading, hence COW won't lead to duplicate it. That could be a way for reducing the memory footprint of the children and, in turn, that the whole program.

Waiting for our child processes might eventually also be done using the system calls select (*NIXs) or epoll (Linux), like when using sockets in network programming.

Furthermore, the design of the webserver nginx provides a somewhat different alternative to spawning and controlling children through async events.

I may apend an addendum on some usage details of the memory sharing library another day. The example of matrix multiplication together with the man pages should help you get started.

Hope this helped you start writing some concurrent programs and put to work all your multitasking capabilities.

Besides books, man pages and wikipedia can be a useful resource. Don't waste it. ;-)

References

Here some additional, useful references.