Philip Guo (Phil Guo, Philip J. Guo, Philip Jia Guo, pgbovine)

C Programming Tips

Contents:

Introduction

This document is a collection of useful lessons about C programming that I have learned over the past few years. It is not meant to serve as a tutorial for the language (many good free tutorials exist online) or to nit-pick on syntax. I am by no means a C language expert, but I have used it quite extensively over the past few years to construct dynamic program analysis tools based on the Valgrind framework for my research. I have done most of my C development on a GNU/Linux system using tools such as GCC and GDB, so some of these tips will be specific for that platform. However, most tips are widely applicable for C development on any platform and may even apply to programming in other languages.

Should I even be programming in C?

This is the first question you should ask yourself before sitting down to start a new programming project. Why are you programming in C in the first place? For quick hacks and small projects, high-level interpreted languages like Python and Ruby allow you to get something up and running much faster with less fuss. For more structured software engineering-like work, languages like Java and C# offer better type checking, more natural support for data abstractions, and more powerful standard libraries. For web programming, there's PHP, XSLT, and loads of other acronyms. Given all of these other choices, why would anyone ever want to program in C nowadays?

Many people (myself included) program in C because they need to interface with libraries and other programs that are written in C. For all of its lack of safety, lack of support for high-level abstractions, and the horrendously gross things you can do with it, C is still one of the most widely used programming languages in the world. It is used to build all types of software ranging from low-level device drivers to graphically-intensive desktop applications. Some of the most sophisticated pieces of software (e.g., operating systems, web servers, compilers, networking tools) are written in C. Much of the enormous body of free and open-source software (e.g., the GNU project) is written in C.

If you want to build on top of these existing code bases (like I am now doing with Valgrind), or if you need low-level close-to-the-hardware functionality, then it is usually easier to program in C than to use another language. However, before starting to program in C, ask yourself whether it is the best language for the job that you are trying to accomplish. You should always choose the language that operates on the highest-level of abstraction that is adequate for your task. Since C is a fairly low-level language, you should probably not use it unless the alternatives are all sub-optimal.

Catch bugs as early as possible during execution

The following tips address the problem that bugs in C programs can often go unnoticed for a long time during execution from when the bug occurs to when the actual error surfaces. Often times, the manifested errors may have little to do with the original bug that caused that error, which can be extremely frustrating to debug. You need to be vigilant about ensuring that your bugs are caught as soon as possible during execution. The C compiler and runtime system will not help you out much; the most useful error message it can give you is the infamous Segmentation fault. Of course, you can always run your program through a debugger and get a backtrace to observe its state at the time of the crash, but the problem with this is that it may be way too late to tell what caused the crash. You want to have your program crash as early as possible if there is a problem in order to not let bugs propagate.

Use assert statements to document and enforce function pre-conditions and other assumptions

The basic unit of abstraction in most C programs is the function. Many non-trivial functions take pointers to data structures as arguments and often mutate them in sophisticated ways. In order to perform the intended task correctly, functions expect their arguments to have certain properties (e.g., this pointer is non-null, this integer is within a certain range). These are called the pre-conditions of the function. In addition, during certain points within a function (such as within a branch of an if statement), certain assumptions must hold true. You should always write down these assumptions as comments within your code.

However, a more powerful way to document these assumptions in addition to writing comments is to include them directly in the code as assert statements. An assert statement takes an expression, evaluates it, and if it is false, aborts the program with an error message stating where in the source code the assertion failed. This simple construct, when used aggressively, can help track down many bugs and also serve as valuable documentation. Use an assert statement whenever you make any non-trivial assumption about some part of your program. The immense power of an assert is that it allows the programmer to catch bugs early before they propagate to other parts of the program and cause weird crashes or memory corruption (which is very easy in C programs if you are not careful). It also serves as documentation that is often better than comments because it actually compiles and executes like the rest of your code. Don't ever feel hesitant to include assert statements because it might 'slow down' your code slightly (you will probably not notice the slowdowns); the benefits of peace-of-mind and improved bug-finding capabilities are far more valuable.

For more information, refer to my article: The benefits of programming with assertions (a.k.a. assert statements)

Use rep. checks to enforce data structure consistency

Non-trivial C programs often operate on data structures that contain integers, strings, and pointers to other data structures. There are certain properties called rep. (representation) invariants that must be true about a particular data structure in order for it to be 'well-formed' (i.e., to conform to its specs). Functions that operate on these data structures often assume that they are 'well-formed' or else the code that operates on them may be useless. Like function pre-conditions, programmers should write down representation invariants in comments. This may be trivial for simple data structures, but can be a difficult task for sophisticated ones which include pointers to other data structures.

A better idea is to actually encode these rep. invariants as a series of assert statements placed within a rep. check (representation check) function. Whenever it is convenient, insert calls to the rep. check functions for the appropriate data structures. Once a particular instance passes a rep. check call, you know, for the moment at least, that your assumptions about it hold true. You can think of a rep. check as an application of assert to protect assumptions about data structures rather than about functions. It is crucial to report errors in data structures as early as possible, because a corrupted data structure can continue working fine for a long time until it crashes in some bizarre way in some distant part of the program far from where it was initialized. The combination of applying assert statements on functions and data structures has helped me to catch countless bugs in my code that would have been much more difficult to debug otherwise.

For more information, refer to my article: The benefits of object-oriented programming using class invariants (a.k.a. representation invariants)

Initialize all data

If you declare a local variable, it is not initialized to anything, and if you allocate a new block of heap memory using malloc(), that is also uninitialized. Uninitialized (garbage) data is never useful, so make sure that none of your data is ever uninitialized. Be vigilant about always writing int foo = 0 instead of int foo and worrying about it later (because you may forget to initialize it on some code path). Instead of malloc(), use calloc() to allocate a block of memory and initialize it to 0. Sure, it takes several more instructions to initialize data to 0, but we're not in the 1970's anymore! A miniscule gain in performance is never an excuse to increase the possibility of introducing nasty bugs into your program.

Don't let functions fail silently

Sometimes it is tempting to have a function return null without doing anything if a certain pre-condition isn't satisfied (i.e., with something like if (foo->bar < 0) return;). After all, the entire program shouldn't abort just because the function sees an input that is inappropriate, right? I disagree. I think that you should rarely have a function return without doing anything. Turn these conditions into asserts so that the function fails with a bang when it sees invalid input. Why are you even passing invalid inputs into the function in the first place? If many of your functions fail silently, then bugs can go un-noticed and surface at really bad times. In general, you need to keep the tightest net you can around your code to make sure that all bugs manifest themselves as early as possible.

Tips on pointers and dynamic memory allocation

Pointers are one of those things that are simple to define (a value that holds a memory address) but can cause beginner C programmers endless headaches. Pointers are unavoidable in non-trivial C programs because C only supports passing function parameters by value, so there is no way for a function to mutate its arguments except if you pass in pointers. In general, use pointers and dynamic memory allocation as little as you need to (favor local variables because the stack is automatically 'garbage collected' when a function exits), but keep these tips in mind when you do need to use them:

Pointers do not equal dynamic memory allocation

It is a common misconception that pointers are only used to refer to dynamically-allocated memory. A pointer is simply a value that holds a memory address. It could hold the address of a global variable, a local variable on the stack, or a dynamically-allocated value on the heap. Because values in the heap do not have names, the only way to refer to them is using pointers. However, many pointers point to global or local variables, whose addresses can be acquired using the & operator. So whenever you see a pointer, don't automatically assume that it refers to a dynamically-allocated value.

Be aware of what calls allocate memory on the heap

The scourge of not having a garbage collector is that you have the burden of manually freeing all dynamically-allocated memory. In concept, manual memory management is really simple: Every call to malloc() of a memory block should be followed later by a matching call to free(). Of course, in practice this can be extremely hard to guarantee. The main reason for this difficulty is that the call to malloc() may reside in some completely unrelated part of your code or even in libraries. You often allocate memory somewhere and pass around a pointer to it to many different functions before you need to free it.

In your own code, make sure that you understand where every pointer comes from and whether it needs to be freed later (remember that pointers to global and stack areas never need to be freed). If you call a library function, you should look at its API and find out whether it dynamically allocates any memory that you may have to free later (e.g., strdup() for duplicating strings).

Always set a pointer to 0 after calling free() on it

Whenever you call free() to free the memory referred-to by a pointer, always set the pointer to 0 right away. The data referred-to by freed pointers does not get erased, so it is possible later in your program to use that freed pointer to read that data back, although that is very dangerous. After a block of memory has been freed, the C library can re-assign it to another pointer at anytime, so the freed pointer should never be used to refer to that block. The best way to keep this guarantee is to set a pointer to 0 after calling free() on it.

Be very careful about pointer aliasing when calling free()

Before you call free() on a pointer, ask yourself whether there are any other pointers that refer to the same memory location. If so, either don't call free() or set all of those other pointers to 0 as well as the freed pointer. One of the worst kinds of C bugs (memory corruption) occurs when you have two or more pointers referring to one location (called aliasing) and you call free() on one of the pointers to free up that location. Even if you set that pointer to 0 immediately, there are still other pointers that refer to that location. Because the memory does not get re-assigned right away, the program can still use those other pointers to read back valid data. However, there is no guarantee of when that memory will get clobbered with new data, and when it does, your program will either crash (hope for that) or worse, nefariously do something incorrect and propagate a bug to some other part before surfacing it.

Style tips

These tips are stylistic because they have no impact on your code's behavior. However, they can greatly help improve the organization and readability of your code, thus making it easier for you to find bugs.

Limit everything to the tightest possible scope.

When programming in any language, you should make sure that program modules are as isolated and self-contained as possible. Unfortunately, C doesn't provide any ways to enforce modularity besides for files. You need to use files to form strict module boundaries. Declare all functions as static (only visible within the file) unless they absolutely need to be called from functions in other files. Keep non-static functions to a minimum; the narrower the interface between different modules (files), the less you will hopefully have to debug. The same goes for global variables. Declare them static unless they need to be accessed by functions in other programs. Global variables make it extremely difficult to reason about program behavior because they destroy locality (just ask proponents of functional programming). Local variables should be declared at the top of the smallest enclosing block where they will be used. Do not get lazy and declare all local variables at the top of a function, or even worse, re-use the same local variable for many different tasks. You are not in the 1970's anymore! You can afford to use those few extra bytes of stack space in exchange for more readable and maintainable code.

Use enumerations instead of #define's or numeric literals

I won't go into a diatribe about the C preprocessor (it can be beneficial if used sparingly when there are no other easy alternatives), but don't use preprocessor #define statements when you can use an enumeration instead. For example,

#define V_TRUCK  1
#define V_CAR    2
#define V_BIKE   3

is bad because the compiler has no clue that these three constants are related to one another. After pre-processing, all the compiler sees are numeric literals (oh yeah, don't use those either!). A better approach is to use an enum

typedef enum {V_TRUCK = 1, V_CAR, V_BIKE} Vehicle;

because now the compiler knows that VTRUCK, VCAR, and V_BIKE all belong to the type Vehicle. C has fairly weak type checking in general, but it can at least type check enums and give you warnings, which is better than no checks if you use #define macros.

Never use a numeric literal when you can use a more meaningful expression

You should never use a numeric literal in your code unless it truly stands for a number and cannot be more easily expressed in some other expression. The most common case of this is the use of numbers to indicate variable sizes when calling memory allocator functions. For example, if you want to allocate an array of 4 integers (initialized to 0), the following 3 statements are identical (on a 32-bit machine like the x86):

1. int* foo = (int*)calloc(4, 4);            // Huh? What does 4 mean?
2. int* foo = (int*)calloc(4, sizeof(int));  // good ...
3. int* foo = (int*)calloc(4, sizeof(*foo)); // but even better

Version 1 is bad because it uses the number 4 to stand for the size of an int when an expression like sizeof(int) in version 2 is clearer, less error-prone (what if you forget that an int is 4 bytes on an x86), and more portable (what if you switch to a different architecture?). However, I prefer version 3, which expresses the size in terms of the actual pointer variable foo. sizeof(*foo) returns the size of whatever foo refers to, which in this case, is an int. This is the most robust solution because if you later change foo to a different type, then you don't have to change the sizeof expression at all. Remember that sizeof is resolved at compile-time, so using it does not incur any run-time overhead over simply using a number. Also, don't be afraid to add the results of a few sizeof expressions together in your code rather than trying to do the math yourself. Constant folding performed by the compiler will resolve those operations and produce one number in the object code.

Debugging tips

Your (non-trivial) program will always have bugs, so it is never too early to learn to debug. Here are some useful strategies:

Use print statements (with endlines) to hone in on a problem, then switch into a debugger

Many programmers scoff at the idea of using print statements as only something that newbies do. However, I think that they are very valuable in giving you an overall feel of program execution. Remember to always end a debugging print statement with an endline character ('\n') because many implementations of printf() perform buffering but are forced to write the buffer to the terminal when it encounters an endline character. If you don't use endlines in your print statements, then sometimes you may think that a particular part of your code hasn't been reached before your program crashed (since nothing was printed out there), when in fact it was reached but your statement never printed in time due to buffering.

If your program crashes, you can run it through a debugger and re-construct its execution state at the time of the crash, but often you want to know what events led up to the crash. You could step through your program from the beginning of execution, but that is extremely tedious and slow. What I do is sprinkle print statements throughout my program, observe the output to see what gets printed before the crash, and repeat until I gain a good sense of where my problem may be. Then I can fire up a debugger and set a breakpoint in the vicinity of that area.

Write complex conditional breakpoint conditions in your source code

Let's say that when you're trying to debug, you only want to stop at a particular function when some complex condition has been met. Ordinarily, you would set a conditional breakpoint in the debugger. You probably need to set the conditions again every time you restart the debugger, which can be annoying. Instead, a better idea is to set the conditional in the program itself like so:

100  void foo(int a, int b, int c) {
101    ... blah ...

120    if (((a % 3) == 0) || 
121        ((b < (c * 2)) && ((c + a) < 0))) {
122      printf("BREAK!!!");
123    }

150    ... bleh ...
151  }

Now all you have to do is to set a regular (unconditional) breakpoint on line 122 (the line with the print statement). This line should only be executed if your condition passes. You now have a conditional breakpoint without messing with the debugger at all. This can often be a much better alternative than setting complex conditions within the debugger because you can include arbitrary function calls in those conditions, which may be difficult or impossible to do within the debugger.

Debug memory corruption errors using watchpoints

Use watchpoints in a debugger like GDB to diagnose memory corruption errors, one of the nastiest and hardest-to-find types of bugs in a memory-unsafe language like C. Memory corruption can occur when some data structure in your program retains a pointer to a region of memory that is freed without its knowledge (via an aliased pointer), and then some other part of your program re-allocates that memory to be used for another purpose. You should suspect memory corruption whenever you find that there is some bug where a value is valid at a particular time but jumbled at another time, and that the time when the corruption occurs varies across different runs.

To fix these bugs, you first need to find the line of code that causes the value to be clobbered with junk. To do so, go into GDB and step to a line of the code where the program accesses the data that will later be corrupted. Set a watchpoint on the expression that you used to access the data using watch foo, where foo is the expression that you want GDB to watch. Now continue to execute your program normally, and GDB will pause it when the contents of foo first changes and give you the line of code that caused the change. Often times, this is enough information for you to realize what you need to do to fix your bug. Note that all a watchpoint does is stop the program when the value contained in the contents of the expression it is watching changes, so to prevent false alarms, you want to set the watchpoint at a time when you no longer expect that data to change ... but of course, it will change due to the memory corruption, which is precisely the change that you want the watchpoint to catch.

Testing tips

Adopt a testing strategy early in development and stick to it

Don't wait until you've finished your entire project to start testing it. Start as early as you can, and build up your test suite as you develop. Testing is tedious, but it is also one of the best ways to uncover bugs. Everybody has their own views on testing, so my only advice here is that you shouldn't ignore it. If you can automate your tests, then you will be encouraged to run them more often during development, which will help you to catch more bugs. As your program grows more complicated, it becomes harder and harder to determine whether a change in one part will have weird side-effects on some other part of the program. If you have a good test suite, running it after your change and seeing that there were no diffs can give you some confidence (although no guarantees) that your change did not have un-intended side-effects.

Example testing strategy: System tests, golden files, and lots of asserts

Testing is like flossing your teeth. Everyone know that it's a good thing, but most people don't do it often enough. I think that many people are turned off by testing because they've been taught in their classes that every single function should have a unit test associated with it. People remember the tedious hours they had to spend on their school programming assignments writing boring and trivial unit tests for functions that, well, perform boring and trivial tasks.

I am going to propose a different testing strategy (one that is more applicable to some domains than others, of course): Don't write unit tests at all. Instead, only write system tests that verify the correctness of either your entire end-to-end system or, if your system is too large, a sub-system module. I think that it's especially difficult to write unit tests for C programs because what you are trying to test are functions that often take pointers as arguments. What are those pointers? Well, they point to data structures that themselves contain pointers, etc... It takes a lot of infrastructure code to properly initialize these data structures before they can be passed into the functions such that they meet the pre-conditions in the first place. It takes even more work to verify that the data structures have been modified in the proper way.

All of this overhead of setting up data structures can be eliminated when writing system test. For a particular function, its calling function usually sets up the data structure state properly (or the caller of the caller does, etc...). All a system test needs as input is the input to the system itself. All a system test needs to verify as output is the output of the system. All of the overhead of setting up intermediate levels of infrastructure can be eliminated. The best aspect of a system test is that when it passes, you know that your entire system (or sub-system) works, not just a small portion of it. Being able to run and test a fully-functional (albeit buggy) system is a whole lot more encouraging and useful than unit testing small parts of it.

For example, for the program analysis tools that I am building, the input is a C program and the output is a text file trace of the program's data structures during execution. I can manually verify that a trace is correct, and then use it as my 'golden file' to compare against the traces produced by subsequent runs of my tool (i.e. using diff). If there are diffs, then I can immediately see what differs, which can help direct me to a particular part of the program that is troubling. I can either try to debug it directly or write some more specialized tests (maybe unit tests) that specifically exercise that part of the code. As development progresses, I write more and more system tests, which also double as regression tests. I only need to manually verify the output once and then save it as a 'golden file.' Any future diffs from that 'golden file' are flagged as anomalies.

System tests are especially powerful when combined with heavy use of assert statements throughout your program to enforce conditions such as data structure rep. invariants and function pre-conditions. As you write more system tests, you will achieve better code coverage, and if something goes wrong, chances are that some assertion will fail. You can view the tests as ways to 'tease' the assert statements by executing them over and over again with different states. The asserts will catch the vast majority of the aforementioned structural problems, and manually verifying the results of the system tests will catch semantic problems (i.e. is the program doing the right thing for this particular input?) that cannot easily be checked by assert statements.

Misc. tips

Make your compiler yell at you and don't ignore its warnings

Turn on your compiler warning levels as high as you feel comfortable doing (if you use GCC, see the various -W options in the manual, most notably -Wall for turning on most warnings), and then look at all the warnings and try to address them as though they were errors. The C standard doesn't provide very tight constraints for what code can legally compile, so you can get away with running fairly atrocious code. For example, once I forgot to include a return statement in a function that returns an integer, thus causing the program to return whatever junk value was in the EAX register. This caused some subtle bug that took me a long time to detect. Even if a compiler warning seems senseless, look at that line of code because it might be the indication of some other error elsewhere. Because you are programming in C, you already lose lots of the compile-time safety checks available in higher-level languages, so you must be vigilant to ensure that you don't allow code that has blatant syntatic bugs to compile and run.

Don't re-invent the wheel (or common data structures)

Unlike Java, there is no authoritative 'standard library' for C. Each platform has its own set of libraries (e.g., GNU libc), but it is often difficult to find convenient data structures (e.g., vectors, trees, hash tables) that programmers in other languages take for granted. This lack of standardization means that there are tons of different implementations of common data structures. If you ever peek inside the source code of any moderate to large C project, you will see that the authors have usually implemented their own data structure library with binary trees, hash tables, and strings (remember that there is no primitive string type in C) etc...

In general, don't write your own common data structures unless you have no other choice (with the exception of simple things like an array-based stack or a linked list that don't take much effort). If you are interfacing with other code, look inside of them to see what data structures they use. Or search for libraries on the Internet like GLib, which provides several common data structures. Many C programmers create their own versions of common data structures to supposedly optimize for performance. Unless you are working on speed-critical applications (like an OS), performance is never an excuse for not using a well-maintained, well-debugged data structure library.

Created: 2005-11-26
Last modified: 2008-11-16