plucky (3) tickit_utf8_count.3.gz

Provided by: libtickit-dev_0.4.3-2_amd64 bug

NAME

       tickit_utf8_count, tickit_utf8_countmore - count characters in Unicode strings

SYNOPSIS

       #include <tickit.h>

       typedef struct {
           size_t bytes;
           int    codepoints;
           int    graphemes;
           int    columns;
       } TickitStringPos;

       size_t tickit_utf8_count(const char *str, TickitStringPos *pos,
           const TickitStringPos *limit);
       size_t tickit_utf8_countmore(const char *str, TickitStringPos *pos,
           const TickitStringPos *limit);

       size_t tickit_utf8_ncount(const char *str, size_t len,
           TickitStringPos *pos, const TickitStringPos *limit);
       size_t tickit_utf8_ncountmore(const char *str, size_t len,
           TickitStringPos *pos, const TickitStringPos *limit);

       Link with -ltickit.

DESCRIPTION

       tickit_utf8_count()  counts characters in the given Unicode string, which must be in UTF-8
       encoding. It starts at the beginning of the string and counts forward over codepoints  and
       graphemes,  incrementing  the  counters  in  pos  until it reaches a limit. It will not go
       further than any of the limits given by the limits structure (where the value -1 indicates
       no limit of that type). It will never split a codepoint in the middle of a UTF-8 sequence,
       nor will it split a grapheme between its codepoints; it is  therefore  possible  that  the
       function  returns  before  any of the limits have been reached, if the next whole grapheme
       would involve going past at least one of the specified limits. The function will also stop
       when it reaches the end of str. It returns the total number of bytes it has counted over.

       The  bytes  member  counts UTF-8 bytes which encode individual codepoints. For example the
       Unicode character U+00E9 is encoded by two bytes 0xc3, 0xa9; it would increment the  bytes
       counter by 2 and the codepoints counter by 1.

       The codepoints member counts individual Unicode codepoints.

       The  graphemes  member  counts  whole  composed  graphical  clusters  of codepoints, where
       combining accents which count as individual codepoints do not count as separate graphemes.
       For example, the codepoint sequence U+0065 U+0301 would increment the codepoint counter by
       2 and the graphemes counter by 1.

       The columns member counts the number of screen columns consumed  by  the  graphemes.  Most
       graphemes consume only 1 column, but some are defined in Unicode to consume 2.

       tickit_utf8_countmore()  is  similar to tickit_utf8_count() except it will not zero any of
       the counters before it starts. It can continue counting where a previous call finished. In
       particular,  it  will assume that it is starting at the beginning of a UTF-8 sequence that
       begins a new grapheme; it will not check these facts and  the  behavior  is  undefined  if
       these assumptions do not hold. It will begin at the offset given by pos.bytes.

       The  tickit_utf8_ncount()  and  tickit_utf8_ncountmore()  variants are similar except that
       they read no more than len bytes from  the  string  and  do  not  require  it  to  be  NUL
       terminated.  They will still stop at a NUL byte if one is found before len bytes have been
       read.

       These functions will all immediately abort if any C0 or C1 control byte other than NUL  is
       encountered, returning the value -1. In this circumstance, the pos structure will still be
       updated with the progress so far.

USAGE

       Typically, these functions would be used either of two ways.

       When given a value in limit.bytes (or no  limit  and  simply  using  string  termination),
       tickit_utf8_count()  will  yield the width of the given string in terminal columns, in the
       pos.columns field.

       When given a value in limit.columns, tickit_utf8_count() will yield the number of bytes of
       that string that will consume the given space on the terminal.

RETURN VALUE

       tickit_utf8_count()  and  tickit_utf8_countmore()  return  the  number  of bytes they have
       skipped over this call, or -1 if they encounter a C0 or C1 byte other than NUL .

SEE ALSO

       tickit_stringpos_zero(3),    tickit_stringpos_limit_bytes(3),     tickit_utf8_mbswidth(3),
       tickit(7)

                                                                             TICKIT_UTF8_COUNT(3)