extraer caracteres entre el primer y segundo guión bajo de los nombres de archivo y contar dichos archivos en la carpeta de Linux

Me gustaría extraer caracteres entre el primer y segundo guión bajo de los nombres de archivo en una carpeta y contar ese tipo de archivos presentes en ella. La carpeta contiene archivos en un formato particular como:

2305195303310_ABC_A08_1378408840043.hl7

2305195303310_ABC_A08_1378408840043.hl7
Q37984932T467566261_DEF_R03_1378825633215.hl7
37982442T467537201_DEF_R03_1378823455384.hl7
37982442T467537201_MNO_R03_1378823455384.hl7
2305195303310_ABC_A08_1378408840053.hl7
Q37984932T467566261_DEF_R03_1378825633215.hl7
37982442T467537201_MNO_R03_1378823455384.hl7

y así

El resultado del script debería darme el resultado como:

ABC 3
DEF 3
MNO 2

shell python perl bash-scripting usuario1679829
fuente

Puede hacerlo de la manera clásica * nix, uniendo pequeños comandos. Primero, encuentre los archivos de interés, para esto puede usar shell globbing :

for i in *_*_*; do echo "$i"; done

Ese comando imprimirá todos los archivos en el directorio actual cuyo nombre contiene dos guiones bajos. Para extraer la cadena entre esos guiones bajos, puede usar cut, diciéndole que se use _como delimitador de campo e imprima el segundo campo:

cut -d '_' -f 2

La canalización del primer comando a través del segundo imprimirá las cadenas que le interesan, pero también imprimirá una línea vacía para aquellos casos en los que no haya caracteres entre los guiones bajos ( foo__barpor ejemplo). Puede filtrarlos usando los grep .que solo imprimirán líneas que contienen al menos un carácter (incluido el espacio en blanco). Finalmente, puede contar pasando la salida a través de sorty uniq -c.

Poner todo junto te da:

$ for i in *_*_*; do echo "$i" | cut -d '_' -f 2 ; done | 
   grep . | sort | uniq -c

  3 ABC
  2 DEF
  1 MNO

Si realmente quieres que el número esté del otro lado, puedes usar awk:

$ for i in *_*_*; do echo "$i" | cut -d '_' -f 2 ; done | 
   grep . | sort | uniq -c | awk '{print $2,$1}'

ABC 3
DEF 2
MNO 1

terdon
fuente

No need to call echo for everything; just do ls *_*_* | ...; instead of grep and sort and uniq, count lines with something in an associative array in awk directly: ls *_*_* | cut -d_ -f2 | awk '/./ { count[$1]++; } END {for (f in count) { print f, count[f]; } }' (note: you can do the cut part in awk as well, of course, but for a comment field that's too awkward (pun unavoidable)).

Gabe

@Gabe that is a very bad idea, parsing ls should always be avoided and breaks on many things, first and foremost names containing spaces. Apart from that, using coreutils is i) faster and ii) more portable than implementing gawk and (ls is notoriously non-portable between systems and locales) iii) more faithful to the *nix way. Of course you can do it with a script but why when you have compiled executables that can do it for you?

terdon

I disagree. Or rather, I agree that dealing with filenames is something best not done in the shell at all - given that there are exactly two byte values (NUL and slash) that aren't potentially parts of file names, the shell utilities simply aren't equipped to deal with all of the weird cases (newlines will break both our suggestions). That said, /bin/ls -1 *_*_* will be no worse than the for loop (which will fork/exec a pipeline for every single file, uselessly, instead of once). Your portability comment is a red herring: that awk code is portable across POSIX.

Gabe

@Gabe for the awk comment I was thinking more about busybox and embedded systems that are very likely not to have awk. I'm willing to bet that the for loop will be faster than implementing a gawk solution but I'll have to test that :). You're right about ls -1 being equivalent, I was thinking of ls alone.

terdon

No contest. time bash -c 'for i in a_b_c d_e_f aa_b_c dd_e_f ; do echo "$i" | cut -d_ -f2 ; done | grep . | sort | uniq -c' vs

time bash -c 'printf "%s\\n" a_b_c d_e_f aa_b_c dd_e_f | cut -d_ -f2 | awk '\''/./ { count[$1]++; } END {for (f in count) { print f, count[f]; } }'\'

- more files will just make the iteration slower. On the workstation, the loop takes 15ms vs 8 for a single pipe; macbook takes 24 vs 12.

Gabe

extraer caracteres entre el primer y segundo guión bajo de los nombres de archivo y contar dichos archivos en la carpeta de Linux

Respuestas: