%% This is part of the OpTeX project, see http://petr.olsak.net/optex

\_codedecl \pdfunidef {PDFunicode strings for outlines <2024-12-18>} % preloaded in format

   \_doc -----------------------------
   \`\pdfunidef``\macro{<text>}` defines `\macro` as <text> converted to
   Big Endian UTF-16 and enclosed to \code{<>}. Example of usage:
   `\pdfunidef\infoauthor{Petr Olšák} \pdfinfo{/Author \infoauthor}`.\nl
   \^`\pdfunidef` does more things than only converting to hexadecimal PDF string.
   The <text> can be scanned in verbatim mode (it is true because \^`\_Xtoc`
   reads the <text> in verbatim mode). First `\edef` do
   `\_scantextokens\unexpanded` and second `\edef` expands the parameter
   according to current values on selected macros from `\_regoul`. Then
   \`\_removeoutmath` converts `..$x^2$..` to `..x^2..`, i.e removes dollars.
   Then \`\_removeoutbraces` converts `..{x}..` to `..x..`.
   Finally, the <text> is detokenized, and \^`\pdfstring` is applied.\nl
   Characters for quotes (and separators for quotes) are activated by first
   `\_scatextokens` and they are defined as the same non-active characters.
   But `\_regoul` can change this definition.
   \_cod -----------------------------

\_def\_pdfunidef#1#2{%
   \_begingroup
      \_catcodetable\_optexcatcodes \_adef"{"}\_adef'{'}%
      \_the\_regoul \_relax % \_regmacro alternatives of logos etc.
      \_ifx\_savedttchar\_undefined \_def#1{\_scantextokens{\_unexpanded{#2}}}%
      \_else \_lccode`\;=\_savedttchar \_lowercase{\_prepinverb#1;}{#2}\fi
      \_edef#1{#1}%
      \_escapechar=-1
      \_edef#1{#1\_empty}%
      \_escapechar=`\\
      \_ea\_edef \_ea#1\_ea{\_ea\_removeoutmath   #1$\_fin$}%  $x$ -> x
      \_ea\_edef \_ea#1\_ea{\_ea\_removeoutbraces #1{\_fin}}%  {x} -> x
      \_edef#1{\_detokenize\_ea{#1}}%
      \_ea
   \_endgroup
   \_ea\_edef\_ea#1\_ea{\_pdfstring\_ea{#1}}
}
\_def\_removeoutbraces #1#{#1\_removeoutbracesA}
\_def\_removeoutbracesA #1{\_ifx\_fin#1\_else #1\_ea\_removeoutbraces\_fi}
\_def\_removeoutmath #1$#2${#1\_ifx\_fin#2\_else #2\_ea\_removeoutmath\_fi}

   \_doc -----------------------------
   The \`\_prepinverb``<macro><separator>{<text>}`,
   e.g.\ `\_prepinverb\tmpb|{aaa |bbb| cccc |dd| ee}`
   does `\def\tmpb{<su>{aaa }bbb<su>{ cccc }dd<su>{ ee}}` where
   <su> is `\scantextokens\unexpanded`. It means that in-line verbatim
   are not argument of `\scantextoken`. First `\edef\tmpb` tokenizes again
   the <text> but not the parts which were in the the in-line verbatim.
   \_cod -----------------------------

\_def\_prepinverb#1#2#3{\_def#1{}%
   \_def\_dotmpb ##1#2##2{\_addto#1{\_scantextokens{\_unexpanded{##1}}}%
      \_ifx\_fin##2\_else\_ea\_dotmpbA\_ea##2\_fi}%
   \_def\_dotmpbA ##1#2{\_addto#1{##1}\_dotmpb}%
   \_dotmpb#3#2\_fin
}

   \_doc -----------------------------
   The \^`\regmacro` is used in order to set the values of macros
   `\em`, `\rm`, `\bf`, `\it`, `\bi`, `\tt`, `\/` and `~` to values usable in
   PDF outlines.
   \_cod -----------------------------

\_regmacro {}{}{\_let\em=\_empty \_let\rm=\_empty \_let\bf=\_empty
    \_let\it=\_empty \_let\bi=\_empty \_let\tt=\_empty \_let\/=\_empty
    \_let~=\_space
}
\public \pdfunidef ;

\_endcode % --------------------------------

There are only two encodings for PDF strings (used in PDFoutlines, PDFinfo,
etc.). The first one is PDFDocEncoding which is single-byte encoding, but it
misses most international characters.

The second encoding is Big Endian UTF-16 which is implemented in this file. It
encodes a single character in either two or four bytes.
This encoding is \TeX/-discomfortable because it looks like

\begtt
<FEFF 0043 0076 0069 010D 0065 006E 00ED 0020 006A 0065 0020 007A 00E1 0074
011B 017E 0020 0061 0020 0078 2208 D835DD44>
\endtt

This example shows a hexadecimal PDF string (enclosed in \code{<>} as opposed
to the literal PDF string enclosed in `()`). In these strings each byte is
represented by two hexadecimal characters (`0-9`, `A-F`). You can tell the
encoding is UTF-16BE, because it starts with \"Byte order mark" `FEFF`. Each
unicode character is then encoded in one or two byte pairs. The example string
corresponds to the text \"Cvičení je zátěž a ${\rm x} ∈ 𝕄$". Notice the 4 bytes
for the last character, $𝕄$. (Even the whitespace would be OK in a PDF file,
because it should be ignored by PDF viewers, but \LuaTeX\ doesn't allow it.)
This conversion is done by the expandable \"pseudoprimitive " \^`\pdfstring`.

\_endinput

2024-12-18 \_pdfhexstring removed, used \_pdfstring.
2021-02-08 \_octalprint -> \_hexprint
2020-03-12 Released