Leyendo contenido PDF con itextsharp dll en VB.NET o C #

Question 1

¿Cómo puedo leer contenido PDF con itextsharp con la clase Pdfreader? Mi PDF puede incluir texto sin formato o imágenes del texto.

Question 2

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

Question 3

ITextSharp 4.x de LGPL / FOSS

var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);

Ninguna de las otras respuestas me resultó útil, todas parecen apuntar al AGPL v5 de iTextSharp. Nunca pude encontrar ninguna referencia a la versión FOSS SimpleTextExtractionStrategyo LocationTextExtractionStrategyen ella.

Algo más que podría ser muy útil junto con esto:

const string PdfTableFormat = @"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);

List<string> ExtractPdfContent(string rawPdfContent)
{
    var matches = PdfTableRegex.Matches(rawPdfContent);

    var list = matches.Cast<Match>()
        .Select(m => m.Value
            .Substring(1) //remove leading (
            .Remove(m.Value.Length - 4) //remove trailing )Tj
            .Replace(@"\)", ")") //unencode parens
            .Replace(@"\(", "(")
            .Trim()
        )
        .ToList();
    return list;
}

Esto extraerá los datos de solo texto del PDF si el texto que se muestra se Foo(bar)codificará en el PDF (Foo\(bar\))Tj, ya que este método volvería Foo(bar)como se esperaba. Este método eliminará mucha información adicional, como las coordenadas de ubicación, del contenido PDF sin procesar.

Question 4

Aquí hay una solución VB.NET basada en la solución de ShravankumarKumar.

Esto SOLO le dará el texto. Las imágenes son una historia diferente.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

Question 5

En mi caso, solo quería el texto de un área específica del documento PDF, así que usé un rectángulo alrededor del área y extraje el texto de él. En el ejemplo siguiente, las coordenadas corresponden a toda la página. No tengo herramientas de creación de PDF, así que cuando llegó el momento de reducir el rectángulo a la ubicación específica, hice algunas conjeturas en las coordenadas hasta que se encontró el área.

Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner.  72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);

Como se señaló en los comentarios anteriores, el texto resultante no mantiene ninguno de los formatos que se encuentran en el documento PDF, sin embargo, me alegré de que conservara los retornos de carro. En mi caso, había suficientes constantes en el texto que pude extraer los valores que necesitaba.

Question 6

Aquí una respuesta mejorada de ShravankumarKumar. Creé clases especiales para las páginas para que pueda acceder a las palabras en el pdf según las filas de texto y la palabra en esa fila.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

//create a list of pdf pages
var pages = new List<PdfPage>();

//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
    //loop all the pages and extract the text
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        pages.Add(new PdfPage()
        {
           content = PdfTextExtractor.GetTextFromPage(reader, i)
        });
    }
}

//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y => 
    new PdfRow() { 
       content = y,
       words = y.Split(' ').ToList()
    }
).ToList());

Las clases personalizadas

class PdfPage
{
    public string content { get; set; }
    public List<PdfRow> rows { get; set; }
}


class PdfRow
{
    public string content { get; set; }
    public List<string> words { get; set; }
}

Ahora puede obtener un índice palabra por fila y palabra.

string myWord = pages[0].rows[12].words[4];

O use Linq para encontrar las filas que contienen una palabra específica.

//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();

//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();

Question 7

Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
        Dim sr As StreamReader = New StreamReader(sTxtfile)
    Dim doc As New Document()
    PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
    doc.Open()
    doc.Add(New Paragraph(sr.ReadToEnd()))
    doc.Close()
End Sub

Answer 1

80

¿Cómo puedo leer contenido PDF con itextsharp con la clase Pdfreader? Mi PDF puede incluir texto sin formato o imágenes del texto.

c# vb.net pdf itextsharp usuario221185
fuente

iTextSharp ahora se llama "iText 7 para .NET" o "itext7-dotnet" en github: link . Se recomienda agregar itext7 con Nuget a su solución.

Peter Huber

Answer 2

iTextSharp ahora se llama "iText 7 para .NET" o "itext7-dotnet" en github: link . Se recomienda agregar itext7 con Nuget a su solución.

Peter Huber

Answer 3

184

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;

public string ReadPdfFile(string fileName)
{
    StringBuilder text = new StringBuilder();

    if (File.Exists(fileName))
    {
        PdfReader pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

ShravankumarKumar
fuente

16

¡Esto debe marcarse como la solución! Esto funciona muy bien para mí.

Carter Medlin

1

Cualquier motivo en particular el pdfReader.Close (); sucede dentro del bucle for?

Jueves 00 mÄ s

8

por qué usar .Close () en absoluto y nousing (var pdfReader = ...) {}

Sebastián

2

Además, ASCIIEncoding.Convertdebería ser Encoding.Convertya que es un método estático

Sebastián

Si alguien necesita un código similar al anterior, implementación paso a paso para leer el texto del pdf en C #, aquí está el enlace, qawithexperts.com/article/c-sharp/… gracias

user3559462

Answer 4

16

¡Esto debe marcarse como la solución! Esto funciona muy bien para mí.

Carter Medlin

Answer 5

1

Cualquier motivo en particular el pdfReader.Close (); sucede dentro del bucle for?

Jueves 00 mÄ s

Answer 6

8

por qué usar .Close () en absoluto y nousing (var pdfReader = ...) {}

Sebastián

Answer 7

2

Además, ASCIIEncoding.Convertdebería ser Encoding.Convertya que es un método estático

Sebastián

Answer 8

Si alguien necesita un código similar al anterior, implementación paso a paso para leer el texto del pdf en C #, aquí está el enlace, qawithexperts.com/article/c-sharp/… gracias

user3559462

Answer 9

ITextSharp 4.x de LGPL / FOSS

var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);

Ninguna de las otras respuestas me resultó útil, todas parecen apuntar al AGPL v5 de iTextSharp. Nunca pude encontrar ninguna referencia a la versión FOSS SimpleTextExtractionStrategyo LocationTextExtractionStrategyen ella.

Algo más que podría ser muy útil junto con esto:

const string PdfTableFormat = @"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);

List<string> ExtractPdfContent(string rawPdfContent)
{
    var matches = PdfTableRegex.Matches(rawPdfContent);

    var list = matches.Cast<Match>()
        .Select(m => m.Value
            .Substring(1) //remove leading (
            .Remove(m.Value.Length - 4) //remove trailing )Tj
            .Replace(@"\)", ")") //unencode parens
            .Replace(@"\(", "(")
            .Trim()
        )
        .ToList();
    return list;
}

Esto extraerá los datos de solo texto del PDF si el texto que se muestra se Foo(bar)codificará en el PDF (Foo\(bar\))Tj, ya que este método volvería Foo(bar)como se esperaba. Este método eliminará mucha información adicional, como las coordenadas de ubicación, del contenido PDF sin procesar.

Answer 10

1

Tiene razón, antes de que la extracción de texto 5.xx estuviera presente en iText simplemente como prueba de concepto y en iTextSharp no en absoluto. Dicho esto, el código que presenta solo funciona en archivos PDF construidos de manera muy primitiva (utilizando fuentes con codificación ASCII'ish y Tj como único operador de dibujo de texto). Puede ser utilizable en entornos muy controlados (en los que puede asegurarse de obtener solo esos PDF primitivos) pero no en general.

mkl

Answer 11

La expresión Regex correcta es: (? <= () (. *?) (? =) Tj)

Diego

Answer 12

6

Aquí hay una solución VB.NET basada en la solución de ShravankumarKumar.

Esto SOLO le dará el texto. Las imágenes son una historia diferente.

Public Shared Function GetTextFromPDF(PdfFileName As String) As String
    Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)

    Dim sOut = ""

    For i = 1 To oReader.NumberOfPages
        Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy

        sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
    Next

    Return sOut
End Function

Carter Medlin
fuente

Cuando intento esto en mi PDF, aparece el mensaje de error "El valor no puede ser nulo. Nombre del parámetro: valor". ¿Alguna idea de qué se trata esto?

Avi

sOut & = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage (oReader, i, its). Además, descubrí algo sobre este error. Si lo saco del bucle y analizo las páginas individuales, funciona en una página y no en la otra. La única diferencia entre los dos que puedo decir es que la página problemática tiene imágenes (que no necesito).

Avi

Si desea ver el PDF, se lo puedo enviar.

Avi

Estoy usando .Net 4.0 e itextsharp 5.1.2.0 (recién descargado). ¿Lo mismo contigo?

Carter Medlin

.Net 3.5 e itextsharp 5.1.1. Actualizaré y veré si está resuelto.

Avi

Answer 13

Cuando intento esto en mi PDF, aparece el mensaje de error "El valor no puede ser nulo. Nombre del parámetro: valor". ¿Alguna idea de qué se trata esto?

Avi

Answer 14

sOut & = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage (oReader, i, its). Además, descubrí algo sobre este error. Si lo saco del bucle y analizo las páginas individuales, funciona en una página y no en la otra. La única diferencia entre los dos que puedo decir es que la página problemática tiene imágenes (que no necesito).

Avi

Answer 15

Si desea ver el PDF, se lo puedo enviar.

Avi

Answer 16

Estoy usando .Net 4.0 e itextsharp 5.1.2.0 (recién descargado). ¿Lo mismo contigo?

Carter Medlin

Answer 17

.Net 3.5 e itextsharp 5.1.1. Actualizaré y veré si está resuelto.

Avi

Answer 18

En mi caso, solo quería el texto de un área específica del documento PDF, así que usé un rectángulo alrededor del área y extraje el texto de él. En el ejemplo siguiente, las coordenadas corresponden a toda la página. No tengo herramientas de creación de PDF, así que cuando llegó el momento de reducir el rectángulo a la ubicación específica, hice algunas conjeturas en las coordenadas hasta que se encontró el área.

Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner.  72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);

Como se señaló en los comentarios anteriores, el texto resultante no mantiene ninguno de los formatos que se encuentran en el documento PDF, sin embargo, me alegré de que conservara los retornos de carro. En mi caso, había suficientes constantes en el texto que pude extraer los valores que necesitaba.

Answer 19

Aquí una respuesta mejorada de ShravankumarKumar. Creé clases especiales para las páginas para que pueda acceder a las palabras en el pdf según las filas de texto y la palabra en esa fila.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

//create a list of pdf pages
var pages = new List<PdfPage>();

//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
    //loop all the pages and extract the text
    for (int i = 1; i <= reader.NumberOfPages; i++)
    {
        pages.Add(new PdfPage()
        {
           content = PdfTextExtractor.GetTextFromPage(reader, i)
        });
    }
}

//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y => 
    new PdfRow() { 
       content = y,
       words = y.Split(' ').ToList()
    }
).ToList());

Las clases personalizadas

class PdfPage
{
    public string content { get; set; }
    public List<PdfRow> rows { get; set; }
}


class PdfRow
{
    public string content { get; set; }
    public List<string> words { get; set; }
}

Ahora puede obtener un índice palabra por fila y palabra.

string myWord = pages[0].rows[12].words[4];

O use Linq para encontrar las filas que contienen una palabra específica.

//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();

//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();

Answer 20

-1

Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
        Dim sr As StreamReader = New StreamReader(sTxtfile)
    Dim doc As New Document()
    PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
    doc.Open()
    doc.Add(New Paragraph(sr.ReadToEnd()))
    doc.Close()
End Sub

Raja
fuente

1

La pregunta es pedir leer un archivo PDF, ¡tu respuesta es crear uno!

AaA

Answer 21

1

La pregunta es pedir leer un archivo PDF, ¡tu respuesta es crear uno!

AaA

Leyendo contenido PDF con itextsharp dll en VB.NET o C #

Respuestas:

ITextSharp 4.x de LGPL / FOSS