2021-08-02
When you compile code such as:
#include <stddef.h>
#include <stdio.h>
int main(void)
{
size_t sz = 0;
printf("%lu\n", sz);
}
your compiler may or may not warn about the format string being
wrong. Depending on your system, there are chances your compiler wants
you to use a “%llu” format string instead, since there a
size_t
is a long long unsigned
. On your
off-the-shelf AMD64 Linux, it’s just a long unsigned
. Why
is that, and how can we workaround this?
The integral types of C are
char, short, int, long, long long,
as well as their unsigned variants, _Bool
,
void
(as long as I didn’t miss any). They have the
following format strings:
%hhd, %hd, %d, %ld, %lld
Everything else are typedefs (or non-integral types), that is, there exists some header with:
typedef off_t long
and another with
typedef size_t unsigned long
etc., for beauty’s sake. These typedefs aren’t part of the core
language of C, but, in some cases like size_t
part of the C
standard, encompassing the C standard library (with printf
,
fread
, …) as well. Other typedefs such as
off_t
aren’t part of C, but of the POSIX interface 1.
The C library function printf
recognizes additional
format strings for those types, such as
%zu (size_t), %p (void*), %s (char*), ...
However, the exact size of a size_t
(e.g.,
unsigned long
or unsigned long long
) is
unspecified. The standard instead simply demands that it shall be big
enough to hold the size of any memory object. Given different hardware
and operating systems, this may vary quite much. Why should a
size_t
be 64 bit on a small 16 bit CPU?
Depending on the platform, we have different typedefs for
size_t
. This means, on AMD64 Linux %zu
may be
in fact equivalent to %lu
and the compiler won’t warn you,
if you use the latter. However, as soon as you switch platforms, this
code will fail. That’s why, we instead use “%zu” as it will be
recognized by the compiler as the correct size at any rate.
However, some types, such as off_t
aren’t part of the C
standard, and as such there is no such format string available for
off_t
. This means, we must need to find out how big
off_t
is in order for it to work (POSIX at least requires
it to be a signed type, so that’s fixed).
To query your system, you can ask the getconf(1) utility to return the available programming environments for a given configuration in a specific standard. The configurations of interest here are documented in c99(1):
$ cat test-off-t.sh
# /bin/sh
for name in _POSIX_V7_ILP32_OFF32 \
_POSIX_V7_ILP32_OFFBIG \
_POSIX_V7_LP64_OFF64 \
_POSIX_V7_LPBIG_OFFBIG; do
printf "%s: %s\n" "$name" "$(getconf -v POSIX.1-2017 $name)";
done
$ ./test-off-t.sh
_POSIX_V7_ILP32_OFF32: undefined
_POSIX_V7_ILP32_OFFBIG: undefined
_POSIX_V7_LP64_OFF64: 1
_POSIX_V7_LPBIG_OFFBIG: undefined
$
So, on my Linux there’s only one available configuration, and that is
a 64 bit off_t
. Supposed, my system would suport both OFF32
and OFF64, then I could query the specific compiler flags to switch to
the one or the other using
$ getconf -v POSIX.1-2017 POSIX_V7_ILP32_OFF32_CFLAGS
and
$ getconf -v POSIX.1-2017 POSIX_V7_ILP32_OFF64_CFLAGS
There are not many systems that are that configurable out there, with Solaris probably being one of the few exceptions.
However, most of the time we don’t want to change our systems size of
off_t
for a speicifc piece of code but simply query it.
While using getconf
for this is fine, we actually can do
this within C already since the <unistd.h>
header
already declares the same information. We can write:
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#if defined(_POSIX_V7_ILP32_OFF32)
// 32-bit = int = %d
# define PRIofft "d"
#elif defined(_POSIX_V7_LP64_OFF64)
// 64-bit = long = %ld
# define PRIofft "ld"
#else
# error "Unsupported Programming Environment"
#endif
int main(void)
{
off_t n = 0;
printf("%" PRIofft "\n", n);
}
We use the trick that string literals (everything enclosed in ““)
will be concatenated by the C Preprocessor. So, if our
off_t
is of 32 bit size, the macro PRIofft
expands to "d"
which will yield
printf("%" "d" "\n", n);
which in turn will become
printf("%d\n", n);
But hey, on Windows an int
is just 16 bit and not 32
bit, you cannot use %d
for off_t
there! This
is true, the C
standard doesn’t specify that an int
is 32 bit either,
it just demands that it may hold all values from [-32767,+32767].
Luckily, C also has the header <stdint.h>
providing us with fixed-width types like int32_t
and
<inttypes.h>
providing macros like
PRId32
alongside that expand to whatever format string we
need to print a 32 bit integer.
We can now write:
#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <inttypes.h>
#if defined(_POSIX_V7_ILP32_OFF32)
# define PRIofft PRId32 // 32-bit int
#elif defined(_POSIX_V7_LP64_OFF64)
# define PRIofft PRId64 // 64-bit int
#else
# error "Unsupported Programming Environment"
#endif
int main(void)
{
off_t n = 0;
printf("%" PRIofft "\n", n);
}
A bit simpler, you could also just upcast the specific value to a
uintmax_t
or intmax_t
respectively. A similar
methods was traditionally used before C99 came around, introducing
fixed-width types and [u]intmax_t
.
Since before the introduction of these types,
long long unsigned
and long long int
could
reasonably be assumed to be the largest integer types available, sans
platform-specific extensions 2, it would be quite safe
to upcast any variable to these and print them that way. In the rare
case that a platform defined, say off_t
to be larger than a
long long int
one would have to workaround this using
classic preprocessor macros.
Other languages such as Java simply chose to have an int
be the same width on every system. This has obvious upshots, but also
some drawbacks. Every CPU that simply cannot provide that big integers
can’t be used with this language3. Further, with C, an
int
is usually chosen to be of a size that’s rather
efficient to use in our system. With a fixed width int
this
isn’t possible anymore.
Even more, C doesn’t even demand that a Byte must be 8 bit, which is
highly useful for those people programming lowlevel audio DSPs (or old
IBM mainframes). In fact, C doesn’t guarantee much more than a
char
being one Byte (however many bits that are, but at
least 8, I think 4), and that short
must
be at least as big as a char
and so on.
POSIX goes a bit farther by including all of the C standard, but also
demanding absolutely ridiculous things like CHAR_BITS = 8
(while also providing us with open
, read
,
write
, getaddrinfo
, …).
But even with POSIX, as we can see, many gaps are to be filled to allow variety of flexibility in the implementation. By now, many of the knobs that POSIX allowed to be configured have more or less converged to a few sane (or not so sane) defacto “standards”, and many programming languages nowadays primarily use fixed-width types as our many different CPU architectures (AMD64, PPC64, AARCH64, RISC-V64, …) have mostly agreed on some good behavior of integer widths, so maybe the need for non-fixed-with integers has indeed gone.
The Legacy, though, will live on forever.
In fact, any type ending with _t
is either
C or POSIX, or someone messed up, since POSIX reserves all types ending
with _t
. If you plan to create your own type but want to
stay POSIX compatible—don’t use _t
.↩︎
Yes, intmax_t
can be larger than a
long long int
, and, in theory off_t
can be as
well—it’s just unlikely.↩︎
Well, actually it can (just like Haskell can provide
infinite-width Integer
s), but it also will be infinitely
slow.↩︎
If you happen to feel a big desire to find out how many
bits are in one byte on your platform, you can look at the
CHAR_BITS
macro.↩︎