Resaltado de sintaxis para mayor golf!

14

Los golfistas

Juntos, nos hemos unido para producir código conciso, funcionalmente hermoso y más feo que el Fantasma de la Ópera de la novela original.

Ha llegado el momento de devolver la belleza al mundo de la programación. Con color. De una manera concisa, funcionalmente hermosa es más fea que el Fantasma de la Ópera de la novela original.

Vamos a codificar un marcador de sintaxis colorido. En la menor cantidad de código posible.

Recibirá a través de un archivo de entrada o Stdin un archivo C válido. El archivo C utilizará la convención de línea que elija y solo contendrá caracteres ASCII 32-126. Debe convertirlo en un archivo HTML que se muestre correctamente al menos en Chrome, que muestre el código fuente con resaltado de sintaxis. La salida puede estar en un archivo o en Stdout.

Debe resaltar:

  • Todas las cadenas y caracteres (incluidos los caracteres de comillas) en verde (# 00FF00). Las cadenas pueden contener caracteres escapados.

  • Todas las palabras C reservadas en azul (# 0000FF).

  • Todos los comentarios en amarillo (# FFFF00).

  • Todas las directivas de preprocesador C en rosa (# FF00FF).

La salida cuando se muestra en Chrome debe:

  • Estar en una fuente de ancho fijo

  • Mostrar nuevas líneas donde aparecieron en la fuente original

  • Reproduzca con precisión los espacios en blanco. Un carácter de tabulación debe tomarse como 4 espacios.

Bonos

  • x 0.9 si incluye números de línea. Los números de línea deben poder alcanzar al menos 99999. Toda la fuente debe estar alineada, por lo tanto, el código fuente con números de línea más pequeños debe comenzar en la misma posición que el código fuente con números de línea más altos.

  • x 0.8 si el fondo de cada línea se alterna entre gris claro (# C0C0C0) y blanco (#FFFFFF)

  • x 0.9 si su código fuente está escrito en C y puede formatearse correctamente.

Puntuación

Este es el código de golf. Su puntaje es la cantidad de bytes de su código fuente multiplicado por cualquier bonificación. El ganador es el golfista con la puntuación más baja.

lochok
fuente
1
@Charles el comportamiento estándar de los IDE más grandes es que los comentarios no se resaltarán más, lo que significa que los comentarios permanecen como comentarios. no he estado trabajando mucho con las directivas de preprocesador, pero la wikipedia aquí tampoco resalta más el código ... también las cadenas son cadenas, no necesitan más evaluación ...
Vogel612
1
Confirmar lo que Vogel612 dice que es el comportamiento correcto
lochok
2
"válido al menos en Chrome" o "se muestra correctamente al menos en Chrome"?
John Dvorak
3
@Trimsty Espera un segundo, soy un idiota. xD No importa.
cjfaure
44
¡Los peores colores de todos!
Greg

Respuestas:

8

Perl 769 caracteres * 0.9 * 0.8 = 554

Probablemente todavía se deben hacer algunas mejoras en algunas de las expresiones regulares, ¡pero está llegando lentamente!

$_=join"",<>;$s="<tt class";$c="</tt>";$d=counter;$e=color;s/\t/    /g;s!<!&lt;!g;s!>!&gt;!g;s!^#.+(?=$|
)!$s=d>$&$c!gm;s!//.+!$s=c>$&$c!g;s|(['"]).*?(?<!\\)(\\\\)*\1|($h=$&)=~s!/!&#47;!g;"$s=s>$h$c"|smeg;s!/\*.*?\*/!$s=c>$&$c!smg;s!\b(_Packed|(au|go)to|break|c(ase|har|onst|ontinue)|d(efault|o|ouble)|e(lse|num|xtern)|f(loat|or)|if|int|long|re(gister|turn)|short|(un)?signed|s(izeof|tatic|truct|witch)|typedef|union|vo(id|latile)|while)\b!$s=r>$&$c!g;s!=(\w)>.+?$c!join"$c
$s=$1>",split$/,$&!smeg;s!
!<tr><td>!g;print"<style>body{font:10px monospace;$d-reset:n}td{white-space:pre}tr:nth-child(even){background:#c0c0c0}tr:before{$d-increment:n;content:$d(n)}.d{$e:#f0f}.r{$e:#00f}.c{$e:#ff0}.s{$e:#0f0}tt tt{$e:inherit!important}</style><table cellspacing=0><tr><td>$_"

Versión ligeramente menos ofuscada con comentarios:

$_=join"",<>; # slurp file
$s="<tt class"; # used later - use <tt/> instead of <span/>, fewer chars!
$c="</tt>";
$d=counter;
$e=color;
s/\t/    /g; # convert tabs to spaces
s!<!&lt;!g; # htmlentity < and >
s!>!&gt;!g;
s!^#.+(?=$|\n)!$s=d>$&$c!gm; # directives
s!//.+!$s=c>$&$c!g; # inline comments
s|(['"]).*?(?<!\\)(\\\\)*\1|($h=$&)=~s!/!&#47;!g;"$s=s>$h$c"|smeg; # strings, might have 0 length - thanks @Einacio; work-around string that contain /* by converting them to HTML entities
s!/\*.*?\*/!$s=c>$&$c!smg; # multi-line comments
s!\b(_Packed|(au|go)to|break|c(ase|har|onst|ontinue)|d(efault|o|ouble)|e(lse|num|xtern)|f(loat|or)|if|int|long|re(gister|turn)|short|(un)?signed|s(izeof|tatic|truct|witch)|typedef|union|vo(id|latile)|while)\b!$s=r>$&$c!g; # reserved words, don't optimise this too much! - thanks @bwoebi
s!=(\w)>.+?$c!join"$c
$s=$1>",split$/,$&!smeg; # any multi-line string/comment, ensure <span/>s are repeated
s!\n!<tr><td>!g; # strip newlines, replace with <tr><td>, don't need </td> _or_ </tr> - thanks @xfix!
print"<style>
body{font:10px monospace;$d-reset:n} /* init counter */
td{white-space:pre} /* preserve whitespace */
tr:nth-child(odd){background:#c0c0c0} /* alternating rows */
tr:before{$d-increment:n;content:$d(n)} /* place counter */
.d{$e:#f0f} /* highlights */
.r{$e:#00f}
.c{$e:#ff0}
.s{$e:#0f0}
tt tt{$e:inherit!important} /* ignore reserved words/comments in strings */
</style><table cellspacing=0><tr><td>$_"

Ahora resalta con éxito la entrada de @ xfix.

Tomó prestada la idea de abandonar </tr>la entrada de @ xfix, ¡gracias!

Ejemplo de salida para la solución de @ xfix .

Dom Hastings
fuente
1
Su solución no puede resaltar mi solución correctamente, porque no escapa al marcado HTML y no muestra los espacios en blanco correctamente.
Konrad Borowski el
1
En HTML5, </tr>y </td>son completamente opcionales, así que simplemente los ignoré.
Konrad Borowski
1
En realidad, podría guardar algunos caracteres utilizando a veces el nombre completo de la palabra clave en regex. Por ejemplo, if|intes un char menos que i(f|nt). O d(efault|o|ouble)es otro char menos que d(efault|o(uble)?).
bwoebi
1
en el violín, '\\' en la línea 61 no está marcado como cadena. ¿Hay un error del ejemplo o falla un caso límite?
Einacio
1
El código seguirá funcionando si mueve el <style>bloque al final y omite la etiqueta final. Entonces también puede omitir el último }del estilo. Claro, es completamente inválido, ¡pero funciona en Chrome!
user2428118
7

C - 1605 1200 caracteres * 0.9 * 0.8 * 0.9 = 777 caracteres

Definitivamente demasiado largo, pero lo que sea. 264 utilizado por la lista de palabras clave en sí. La versión larga de un revestimiento. No usa asignaciones de memoria, por lo que el uso de memoria es muy bajo (y todo es global, por lo que la pila no se usa realmente). HTML de muestra en JSFiddle . En mi opinión, el soporte de comentarios fue lo más complejo en el código.

char*k[]={"auto","break","case","char","const","continue","default","do","double","else","enum","extern","float","for","goto","if","int","long","register","return","short","signed","sizeof","static","struct","switch","typedef","union","unsigned","void","volatile","while"},b[9];c;p;e;p;l=1;q;s;i;main(){printf("<style>tr:nth-child(2n){background:#C0C0C0}</style><table style=font-family:monospace;white-space:pre-wrap><tr><td>1<td>");while(~(c=getchar())){if(!e&&q){if(q==c){printf("%c</span>",c);q=0;continue;}}else if(isalpha(c)&&p<8){b[p++]=c;continue;}else if(b[0]){for(i=0;i<32;i++){s!=2&&!strcmp(k[i],b)&&(printf("<span style=color:#00F>%s</span>",b),b[0]=0);}printf("%s",b);memset(b,0,9);}p=0;switch(c){case'<':printf("&lt;");goto e;case'&':printf("&amp;");goto e;case 92:putchar(c);e^=1;goto e;case'/':q||1==s?putchar('/'):3==s?(printf("*/"),s=0):(s=1);break;case'*':if(s==1){s=2;printf("<span style=color:#FF0>/*");}else if(s==2){s=3;}else{goto d;}break;case 39:case'"':e||q||(q=c,printf("<span style=color:#0F0>"));e=0;goto d;case 10:l+=1;printf("<tr><td>%d<td>",l);if(p&&e){case'#':e=0;q||(p=1,printf("<span style=color:#F0F>"));}else{p=0;}default:d:10!=c&&putchar(c);e:s=s/2*2;}}puts(b);}

Y la versión más larga (que es tan legible como un programa real, aparte de algunos trucos de golf de código que no pensé que podría aplicar fácilmente mientras jugaba el programa.

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>
#define ARRAY_SIZE(array) (sizeof(array) / sizeof *(array))
/* Sample comment. */
int main(void) {
    const char *keywords[] = {
        "auto",     "break",    "case",     "char",     "const",
        "continue", "default",  "do",       "double",   "else",
        "enum",     "extern",   "float",    "for",      "goto",
        "if",       "int",      "long",     "register", "return",
        "short",    "signed",   "sizeof",   "static",   "struct",
        "switch",   "typedef",  "union",    "unsigned", "void",
        "volatile", "while",
    };
    int character;
    int preprocessor = 0;
    int escape = 0;
    char buffer[9] = {0};
    int pos = 0;
    int line = 1;
    int quote = 0;
    int comment_state = 0;
    printf("<style>tr:nth-child(2n){background:#C0C0C0}</style><table style=font-family:monospace;white-space:pre-wrap><tr><td>1<td>");
    while ((character = getchar()) != EOF) {
        if (!escape && quote) {
            if (quote == character) {
                printf("%c</span>", character);
                quote = 0;
                continue;
            }
        }
        else if (isalpha(character) && pos < 8) {
            buffer[pos] = character;
            pos += 1;
            continue;
        }
        else if (buffer[0]) {
            int i;
            for (i = 0; i < ARRAY_SIZE(keywords); i++) {
                if (comment_state != 2 && strcmp(keywords[i], buffer) == 0) {
                    printf("<span style=color:#00F>%s</span>", buffer);
                    buffer[0] = 0;
                }
            }
            printf("%s", buffer);
            memset(buffer, 0, 9);
        }
        pos = 0;
        switch (character) {
        case '<':
            printf("&lt;");
            goto e;
        case '&':
            printf("&amp;");
            goto e;
        case '\\':
            putchar(character);
            escape ^= 1;
            goto e;
        case '/':
            if (quote || comment_state == 1) {
                putchar('/');
            }
            else if (comment_state == 3) {
                printf("*/");
                comment_state = 0;
            }
            else {
                comment_state = 1;
            }
            break;
        case '*':
            if (comment_state == 1) {
                comment_state = 2;
                printf("<span style=color:#FF0>/*");
            }
            else if (comment_state == 2) {
                comment_state = 3;
            }
            else {
                goto d;
            }
            break;
        case '\'':
        case '"':
            if (!escape && !quote) {
                quote = character;
                printf("<span style=color:#0F0>");
            }
            escape = 0;
            goto d;
        case '\n':
            line += 1;
            printf("<tr><td>%d<td>", line);

            /* Execute next only if conditions match. */
            if (preprocessor && escape) {
        case '#':
                escape = 0;
                if (!quote) {
                    preprocessor = 1;
                    printf("<span style=color:#F0F>");
                }
            }
            else {
                preprocessor = 0;
            }
            /* fallthru */
        default:
        d:
            if (character != '\n') putchar(character);
        e:
            comment_state = comment_state / 2 * 2;
        }
    }
    printf("%s</table>", buffer);
}
Konrad Borowski
fuente
¡La salida se ve fantástica!
lochok
@lochok: Y ahora se acorta a 1200 bytes. Esto todavía es más largo que la solución Perl, pero ahora la brecha es menor.
Konrad Borowski
1
Su código no parece funcionar con comentarios de varias líneas.
nickguletskii
3

PHP 606 bytes × 0.9 × 0.8 = 436

<style>li:nth-child(odd){background:#f5f5f5}pre{tab-size:4;-moz-tab-size:4}</style><pre><ol><li><?php $p='preg_match';preg_match_all('_\w+|("|\')(\\\\?.)*?\1|#(.(?!/[/*]))*|//.*|(?s)/\*.*?\*/|.+?_',stream_get_contents(STDIN),$m);foreach($m[0]as$t)echo'</font><font color=#',$t[0]=='#'?'d0d':($p('_^/[/*]_',$t)?'bb0':($p('/"|\'/',$t)?'0d0':($p('/^(auto|break|(cas|continu|doubl|els|volatil|whil)e|char|(cons|defaul|floa|in|shor|struc)t|do|enum|extern|for|goto|if|long|register|return|sizeof|static|switch|typedef|union|(un|)signed|void)$/',$t)?'00f':0))),'>'.preg_replace("/\r?\n/",'<li>',htmlentities($t));

Formateado:

<style>li:nth-child(odd){background:#f5f5f5}pre{tab-size:4;-moz-tab-size:4}</style>
<pre><ol><li><?php
$p='preg_match';
preg_match_all('_\w+|("|\')(\\\\?.)*?\1|#(.(?!/[/*]))*|//.*|(?s)/\*.*?\*/|.+?_',
    stream_get_contents(STDIN),$m);
foreach($m[0]as$t)
    echo'</font><font color=#',
        $t[0]=='#'?'d0d':(
        $p('_^/[/*]_',$t)?'bb0':(
        $p('/"|\'/',$t)?'0d0':(
        $p('/^(auto|break|(cas|continu|doubl|els|volatil|whil)e|char|'.
        '(cons|defaul|floa|in|shor|struc)t|do|enum|extern|for|goto|if|long|register|'.
        'return|sizeof|static|switch|typedef|union|(un|)signed|void)$/',$t)?'00f':
        0))),
        '>'.preg_replace("/\r?\n/",'<li>',htmlentities($t));
  • Lee desde stdin y escribe en stdout.

  • Las terminaciones de línea aceptadas son \ n y \ r \ n.

  • Hace números de línea y alternancia de color de línea.

  • Utilicé colores ligeramente diferentes para poder mirarlo, aunque no de una manera que afecte el recuento de bytes.

  • No tengo Chrome para probarlo, aunque está bien en Firefox.

Boann
fuente
2

C ++ - 5067 bytes 4612 * 0.9 * 0.8 = 3320 (* 0.9 = 2988 si poder formatearse cuenta - está escrito en C ++)

Me doy cuenta de que esto es más grande que las soluciones ya presentadas aquí, pero he decidido publicar esto de todos modos porque comencé a trabajar en mi versión antes de que se publicara la solución C de xfix.

  • Funciona con comentarios multilínea.
  • Produce HTML que tiene errores (pero se muestra correctamente en Chrome)
  • Se lee en input.c y produce output.html

La mitad de esto es la gran variedad de palabras clave C y C ++.

#include <iostream>
#include <string>
#include <fstream>
#include <sstream>
#include <locale>
#define B break
#define Z(a,b)if(s[i]==a){z=b;o+=OP+p(s[i]);i++;B;}
#define V(a,b)if(e(s,i,a)){z=b;o+=OC+p(s.substr(i,2));i+=2;B;}
#define N(a) if(s[i]=='\n'){a;o+=nl();i++;B;}
#define Q(a)if(e(s,i,"\\")){o+=p(s.substr(i,2));i+=2;B;}if(s[i]==a){z=0;o+=p(s[i])+CL;i++;B;}
#define C case
using namespace std;string k[]={"__abstract","__alignof","_Alignas","_alignof","and","and_eq","__asm","_asm","asm","__assume","_assume","auto","__based","_based","bitand","bitor","bool","_Bool","__box","break","__builtin_alignof","_builtin_alignof","__builtin_isfloat","case","catch","__cdecl","_cdecl","_Complex","cdecl","char","class","__compileBreak","_compileBreak","compl","const","const_cast","continue","__declspec","_declspec","default","__delegate","delete","do","double","dynamic_cast","else","enum","__event","__except","_except","explicit","__export","_export","extern","false","__far","_far","far","__far16","_far16","__fastcall","_fastcall","__feacpBreak","_feacpBreak","__finally","_finally","float","for","__forceinline","_forceinline","__fortran","_fortran","fortran","friend","_Generic","__gc","goto","__hook","__huge","_huge","huge","_Imaginary","__identifier","if","__if_exists","__if_not_exists","__inline","_inline","inline","int","__int128","__int16","_int16","__int32","_int32","__int64","_int64","__int8","_int8","__interface","__leave","_leave","long","__multiple_inheritance","_multiple_inheritance","mutable","namespace","__near","_near","near","new","_Noreturn","__nodefault","__nogc","__nontemporal","not","not_eq","__nounwind","__novtordisp","_novtordisp","operator","or","or_eq","__pascal","_pascal","pascal","__pin","__pragma","_pragma","private","__probability","__property","protected","__ptr32","_ptr32","__ptr64","_ptr64","public","__raise","register","reinterpret_cast","restrict","__restrict","__resume","return","__sealed","__serializable","_serializable","short","signed","__single_inheritance","_single_inheritance","sizeof","static","static_cast","_Static_assert","__stdcall","_stdcall","struct","__super","switch","__sysapi","__syscall","_syscall","template","this","__thiscall","_thiscall","throw","_Thread_local","__transient","_transient","true","__try","_try","try","__try_cast","typedef","typeid","typename","__typeof","__unaligned","__unhook","union","unsigned","using","__uuidof","_uuidof","__value","virtual","__virtual_inheritance","_virtual_inheritance","void","volatile","__w64","_w64","__wchar_t","wchar_t","while","xor","xor_eq"};string OS="<font color=\"#00FF00\">";string OC="<font color=\"#FFFF00\">";string OK="<font color=\"#0000FF\">";string OP="<font color=\"#FF00FF\">";string CL="</font>";string NL[]={ "<li class=\"li l1\">","<li class=\"li l2\">" };bool lo=1;string nl() {lo=!lo;return "</li>"+NL[lo];}bool r(char c,string s) {for (size_t i=0; i<s.size(); i++)if (c==s[i])return 0;return 0;}bool is(string s,int i) {return !(i<0||i>=s.size())&&((s[i]=='_')||isalpha(s[i]));}bool ic(string s,int i){return !(i<0||i>=s.size())&&(is(s,i)||('0'<=s[i]&&s[i]<='9'));}bool e(string a,int s,string b) {return !(a.size()-s<b.size())&&a.substr(s,b.size())==b;}string p(char c) {switch (c) {C  '&':return "&amp;";C  '\"':return "&quot;";C  '\'':return "&apos;";C  '<':return "&lt;";C  '>':return "&gt;";}stringstream s;s<<c;return s.str();}string p(string s) {string ans="";for (size_t i=0; i<s.size(); i++) {ans+=p(s[i]);}return ans;}string h(string s) {int z=0;size_t i=0;string o ="<html><body><style type=\"text/css\">.l{list-style-type: decimal;margin-top:0;margin-bottom:0;} .li{display:list-item;word-wrap:B-word;} .l1{background-color:#FFFFFF;} .l2{background-color:#EEEEEE;} .cd{white-space:pre;}</style><code class=\"cd\"><ul class=\"l\">";o+=NL[1];for (; i<s.size();) {switch (z) {C 0:{Z('#',6)Z('"',3)Z('\'',2)V("//",4)V("/*",5)N()for (size_t j=0; j<201; j++)if (e(s,i,k[j])&&!ic(s,i+k[j].size())) {o+=OK+p(k[j])+CL;i+=k[j].size();B;}if (is(s,i)) {z=7;o+=p(s[i]);i++;if (i+1==s.size()||!ic(s,i+1)) {z=0;}B;}o+=p(s[i]);i++;B;}o+=p(s[i]);i++;B;C  2:Q('\'')C  3:Q('"')C 4:{N(z=0)o+=p(s[i]); i++;B;}C 5:{if (e(s,i,"*/")) {z=0;o+=p(s.substr(i,2))+CL;i+=2;B;}N()o+=p(s[i]);i++;B;}C 6:{if (s[i]=='\n') {int j=i-1;for (; j>=0&&r(s[j],"\n\t "); j--);if (j<0||s[j] != '\\') {z=0;o+=CL+nl();i++;B;}o+=nl();i++;B;}o+=p(s[i]);i++;B;}C 7:{if (i+1==s.size()||!ic(s,i+1)) {z=0;}o+=p(s[i]);i++;B;}}}o+="</ul></code>";return o;}int main() {ifstream i("input.c");ofstream o("output.html");string cCode((istreambuf_iterator<char>(i)),istreambuf_iterator<char>());o<<h(cCode)<<endl;}

Versión legible:

#include <iostream>
#include <string>
#include <fstream>
#include <sstream>
using namespace std;

//The 201 keywords from C and C++. Not sure if all of them are listed here!
const size_t NUMBER_OF_KEYWORDS = 201;
string keywords[] = { "__abstract", "__alignof", "_Alignas", "_alignof", "and",
        "and_eq", "__asm", "_asm", "asm", "__assume", "_assume", "auto",
        "__based", "_based", "bitand", "bitor", "bool", "_Bool", "__box",
        "break", "__builtin_alignof", "_builtin_alignof", "__builtin_isfloat",
        "case", "catch", "__cdecl", "_cdecl", "_Complex", "cdecl", "char",
        "class", "__compileBreak", "_compileBreak", "compl", "const",
        "const_cast", "continue", "__declspec", "_declspec", "default",
        "__delegate", "delete", "do", "double", "dynamic_cast", "else", "enum",
        "__event", "__except", "_except", "explicit", "__export", "_export",
        "extern", "false", "__far", "_far", "far", "__far16", "_far16",
        "__fastcall", "_fastcall", "__feacpBreak", "_feacpBreak", "__finally",
        "_finally", "float", "for", "__forceinline", "_forceinline",
        "__fortran", "_fortran", "fortran", "friend", "_Generic", "__gc",
        "goto", "__hook", "__huge", "_huge", "huge", "_Imaginary",
        "__identifier", "if", "__if_exists", "__if_not_exists", "__inline",
        "_inline", "inline", "int", "__int128", "__int16", "_int16", "__int32",
        "_int32", "__int64", "_int64", "__int8", "_int8", "__interface",
        "__leave", "_leave", "long", "__multiple_inheritance",
        "_multiple_inheritance", "mutable", "namespace", "__near", "_near",
        "near", "new", "_Noreturn", "__nodefault", "__nogc", "__nontemporal",
        "not", "not_eq", "__nounwind", "__novtordisp", "_novtordisp",
        "operator", "or", "or_eq", "__pascal", "_pascal", "pascal", "__pin",
        "__pragma", "_pragma", "private", "__probability", "__property",
        "protected", "__ptr32", "_ptr32", "__ptr64", "_ptr64", "public",
        "__raise", "register", "reinterpret_cast", "restrict", "__restrict",
        "__resume", "return", "__sealed", "__serializable", "_serializable",
        "short", "signed", "__single_inheritance", "_single_inheritance",
        "sizeof", "static", "static_cast", "_Static_assert", "__stdcall",
        "_stdcall", "struct", "__super", "switch", "__sysapi", "__syscall",
        "_syscall", "template", "this", "__thiscall", "_thiscall", "throw",
        "_Thread_local", "__transient", "_transient", "true", "__try", "_try",
        "try", "__try_cast", "typedef", "typeid", "typename", "__typeof",
        "__unaligned", "__unhook", "union", "unsigned", "using", "__uuidof",
        "_uuidof", "__value", "virtual", "__virtual_inheritance",
        "_virtual_inheritance", "void", "volatile", "__w64", "_w64",
        "__wchar_t", "wchar_t", "while", "xor", "xor_eq" };

// Different states
const int NONE = 0;
const int WHITESPACE = 1;
const int CHAR_UNCLOSED = 2;
const int STRING_UNCLOSED = 3;
const int LINE_COMMENT_UNCLOSED = 4;
const int MULTILINE_COMMENT_UNCLOSED = 5;
const int PREPROCESSOR_UNCLOSED = 6;
const int IDENTIFIER = 7;

//Different elements
const string OPEN_STRING = "<font color=\"#00FF00\">";
const string CLOSE_STRING = "</font>";
const string OPEN_COMMENT = "<font color=\"#FFFF00\">";
const string CLOSE_COMMENT = "</font>";
const string OPEN_KEYWORD = "<font color=\"#0000FF\">";
const string CLOSE_KEYWORD = "</font>";
const string OPEN_PREPROCESSOR = "<font color=\"#FF00FF\">";
const string CLOSE_PREPROCESSOR = "</font>";

//Alternating background
const string NEW_LINE[] = { "<li class=\"li l1\">", "<li class=\"li l2\">" };
bool lineOdd = true;
string getNewLineHTML() {
    lineOdd = !lineOdd;
    return "</li>" + NEW_LINE[lineOdd];
}

//Check if the character is in the string chars
bool inRange(char c, string chars) {
    for (size_t i = 0; i < chars.size(); i++)
        if (c == chars[i])
            return true;
    return false;
}

//Check if the character is the start of an identifier
bool isIdentifierStart(string input, int i) {
    if (i < 0 || i >= input.size())
        return false;
    return (input[i] == '_') || ('a' <= input[i] && input[i] <= 'z')
            || ('A' <= input[i] && input[i] <= 'Z');
}
//Check if the character is the continuation of an identifier
bool isIdentifierCont(string input, int i) {
    if (i < 0 || i >= input.size())
        return false;
    return ('0' <= input[i] && input[i] <= '9') || isIdentifierStart(input, i);
}
//Check if a[start + i] == b[i], i<b.size()
bool eqRange(string a, int start, string b) {
    if (a.size() - start < b.size())
        return false;
    return a.substr(start, b.size()) == b;
}
//Escape the sourcecode for HTML
string escape(char c) {
    switch (c) {
    case '&':
        return "&amp;";
    case '\"':
        return "&quot;";
    case '\'':
        return "&apos;";
    case '<':
        return "&lt;";
    case '>':
        return "&gt;";
    }
    //Is there a better way to do this?
    stringstream strm;
    strm << c;
    return strm.str();
}
string escape(string str) {
    string ans = "";
    for (size_t i = 0; i < str.size(); i++) {
        ans += escape(str[i]);
    }
    return ans;
}
string highlight(string input) {
    //The current state
    int state = NONE;
    //The current position
    size_t i = 0;
    //Styles
    string output =
            "<html><body><style type=\"text/css\">.l{list-style-type: decimal; margin-top: 0; margin-bottom: 0;} .li{ display: list-item; word-wrap: break-word;} .l1{background-color: #FFFFFF;} .l2{background-color: #EEEEEE;} .cd{white-space: pre;}</style><code class=\"cd\"><ul class=\"l\">";
    output += NEW_LINE[1];
    for (; i < input.size();) {
        switch (state) {
        case NONE: {
            if (input[i] == '#') { //Start a preprocessor statement
                state = PREPROCESSOR_UNCLOSED;
                output += OPEN_PREPROCESSOR + escape(input[i]);
                i++;
                break;
            }
            if (input[i] == '"') { //Start a string
                state = STRING_UNCLOSED;
                output += OPEN_STRING + escape(input[i]);
                i++;
                break;
            }
            if (input[i] == '\'') { //Start a character
                state = CHAR_UNCLOSED;
                output += OPEN_STRING + escape(input[i]);
                i++;
                break;
            }
            if (eqRange(input, i, "//")) { //Start a single line comment
                state = LINE_COMMENT_UNCLOSED;
                output += OPEN_COMMENT + escape(input.substr(i, 2));
                i += 2;
                break;
            }
            if (eqRange(input, i, "/*")) { //Start a multi-line comment
                state = MULTILINE_COMMENT_UNCLOSED;
                output += OPEN_COMMENT + escape(input.substr(i, 2));
                i += 2;
                break;
            }
            if (input[i] == '\n') { //New lines are special!
                output += getNewLineHTML();
                i++;
                break;
            }
            for (size_t j = 0; j < NUMBER_OF_KEYWORDS; j++) //Iterate through keywords
                if (eqRange(input, i, keywords[j])
                        && !isIdentifierCont(input, i + keywords[j].size())) { // The keyword can't be a prefix of an identifier, so test for that
                    output += OPEN_KEYWORD + escape(keywords[j])
                            + CLOSE_KEYWORD;
                    i += keywords[j].size();
                    break;
                }
            //Treat identifiers separately because we need to separate identifiers from keywords.
            if (isIdentifierStart(input, i)) {
                state = IDENTIFIER;
                output += escape(input[i]);
                i++;
                //If the next character is not a part of the identifier, go to the NONE state
                if (i + 1 == input.size() || !isIdentifierCont(input, i + 1)) {
                    state = NONE;
                }
                break;
            }
            //Other characters
            output += escape(input[i]);
            i++;
            break;
        }
        case CHAR_UNCLOSED: {
            if (eqRange(input, i, "\\")) { //Treat escape sequences inside quotes
                output += escape(input.substr(i, 2));
                i += 2;
                break;
            }
            if (input[i] == '\'') { //Close quote
                state = NONE;
                output += escape(input[i]) + CLOSE_STRING;
                i++;
                break;
            }
            output += escape(input[i]); //Other characters go into the literal
            i++;
            break;
        }
        case STRING_UNCLOSED: {
            if (eqRange(input, i, "\\")) { //Treat escape sequences inside quotes
                output += escape(input.substr(i, 2));
                i += 2;
                break;
            }
            if (input[i] == '"') { //Close quote
                state = NONE;
                output += escape(input[i]) + CLOSE_STRING;
                i++;
                break;
            }
            output += escape(input[i]); //Other characters go into the literal
            i++;
            break;
        }
        case LINE_COMMENT_UNCLOSED: {
            if (input[i] == '\n') { //Close comment with new line
                state = NONE;
                output += CLOSE_COMMENT + getNewLineHTML();
                i++;
                break;
            }
            output += escape(input[i]); //Comment body
            i++;
            break;
        }
        case MULTILINE_COMMENT_UNCLOSED: {
            if (eqRange(input, i, "*/")) { //Close multiline comment
                state = NONE;
                output += escape(input.substr(i, 2)) + CLOSE_COMMENT;
                i += 2;
                break;
            }
            if (input[i] == '\n') { //New lines are special!
                output += getNewLineHTML();
                i++;
                break;
            }
            output += escape(input[i]); //Comment body
            i++;
            break;
        }
        case PREPROCESSOR_UNCLOSED: {
            if (input[i] == '\n') { //Close preprocessor statement or go to next line
                int j = i - 1;
                for (; j >= 0 && inRange(input[j], "\n\t "); j--)
                    //Seek las non-whitespace character
                    ;
                if (j < 0 || input[j] != '\\') { //Check if the last non-whitespace character is a backslash
                    state = NONE; //... If it isn't, close the preprocessor statement
                    output += CLOSE_PREPROCESSOR + getNewLineHTML();
                    i++;
                    break;
                }
                output += getNewLineHTML(); //... If it is, we need to extend the preprocessor statement to the next line
                i++;
                break;
            }
            output += escape(input[i]);
            i++;
            break;
        }
        case IDENTIFIER: {
            //If the next character is not a part of the identifier, go to the NONE state
            if (i + 1 == input.size() || !isIdentifierCont(input, i + 1)) {
                state = NONE;
            }
            output += escape(input[i]);
            i++;
            break;
        }
        }
    }
    output += "</ul></code>";
    return output;
}
int main() {
    ifstream input("input.c");
    ofstream output("output.html");
    std::string cCode((std::istreambuf_iterator<char>(input)),
            std::istreambuf_iterator<char>());
    output << highlight(cCode) << endl;
}
nickguletskii
fuente
¿Es realmente necesario resaltar las palabras clave del compilador? No son estándar. Pero si realmente lo desea, suponga que todo lo que comienza con __(dos guiones bajos) es una palabra clave, ya que la especificación dice que está reservada para fines de implementación.
Konrad Borowski