Raspado de tablas html en marcos de datos R utilizando el paquete XML

153

¿Cómo raspo tablas html usando el paquete XML?

Tome, por ejemplo, esta página de wikipedia sobre el equipo de fútbol brasileño . Me gustaría leerlo en R y obtener la tabla "lista de todos los partidos que Brasil ha jugado contra equipos reconocidos por la FIFA" como un marco de datos. ¿Cómo puedo hacer esto?

html r xml parsing web-scraping Eduardo Leoni
fuente

11

Para resolver los selectores XPath, echa un vistazo a selectorgadget.com/ - que es impresionante

Hadley

144

... o un intento más corto:

library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://en.wikipedia.org/wiki/Brazil_national_football_team",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))

la mesa elegida es la más larga de la página

tables[[which.max(n.rows)]]

Jim G.
fuente

La ayuda readHTMLTable también proporciona un ejemplo de lectura de una tabla de texto sin formato de un elemento PRE HTML usando htmlParse (), getNodeSet (), textConnection () y read.table ()

Dave X

48

library(RCurl)
library(XML)

# Download page using RCurl
# You may need to set proxy details, etc.,  in the call to getURL
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
# Process escape characters
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

# Parse the html tree, ignoring errors on the page
pagetree <- htmlTreeParse(webpage, error=function(...){})

# Navigate your way through the tree. It may be possible to do this more efficiently using getNodeSet
body <- pagetree$children$html$children$body 
divbodyContent <- body$children$div$children[[1]]$children$div$children[[4]]
tables <- divbodyContent$children[names(divbodyContent)=="table"]

#In this case, the required table is the only one with class "wikitable sortable"  
tableclasses <- sapply(tables, function(x) x$attributes["class"])
thetable  <- tables[which(tableclasses=="wikitable sortable")]$table

#Get columns headers
headers <- thetable$children[[1]]$children
columnnames <- unname(sapply(headers, function(x) x$children$text$value))

# Get rows from table
content <- c()
for(i in 2:length(thetable$children))
{
   tablerow <- thetable$children[[i]]$children
   opponent <- tablerow[[1]]$children[[2]]$children$text$value
   others <- unname(sapply(tablerow[-1], function(x) x$children$text$value)) 
   content <- rbind(content, c(opponent, others))
}

# Convert to data frame
colnames(content) <- columnnames
as.data.frame(content)

Editado para agregar:

Salida de muestra

                     Opponent Played Won Drawn Lost Goals for Goals against  % Won
    1               Argentina     94  36    24   34       148           150  38.3%
    2                Paraguay     72  44    17   11       160            61  61.1%
    3                 Uruguay     72  33    19   20       127            93  45.8%
    ...

Algodón Richie
fuente

77

Para cualquier otra persona que tenga la suerte de encontrar esta publicación, es probable que este script no se ejecute a menos que el usuario agregue su información de "Agente de usuario", como se describe en esta otra publicación útil: stackoverflow.com/questions/9056705/…

Rguy

26

Otra opción usando Xpath.

library(RCurl)
library(XML)

theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
webpage <- getURL(theurl)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/th", xmlValue)
results <- xpathSApply(pagetree, "//*/table[@class='wikitable sortable']/tr/td", xmlValue)

# Convert character vector to dataframe
content <- as.data.frame(matrix(results, ncol = 8, byrow = TRUE))

# Clean up the results
content[,1] <- gsub("Â ", "", content[,1])
tablehead <- gsub("Â ", "", tablehead)
names(content) <- tablehead

Produce este resultado

> head(content)
   Opponent Played Won Drawn Lost Goals for Goals against % Won
1 Argentina     94  36    24   34       148           150 38.3%
2  Paraguay     72  44    17   11       160            61 61.1%
3   Uruguay     72  33    19   20       127            93 45.8%
4     Chile     64  45    12    7       147            53 70.3%
5      Peru     39  27     9    3        83            27 69.2%
6    Mexico     36  21     6    9        69            34 58.3%

aprendiz
fuente

Excelente llamada sobre el uso de xpath. Punto menor: puede simplificar ligeramente el argumento de ruta cambiando // * / a //, por ejemplo, "// table [@ class = 'wikitable sortable'] / tr / th"

Richie Cotton el

Recibo un error "Las secuencias de comandos deben usar una cadena informativa de User-Agent con información de contacto, o pueden ser bloqueadas por IP sin previo aviso". [2] "¿Hay alguna forma de

evitar

2

opciones (RCurlOptions = list (useragent = "zzzz")). Consulte también la sección "Tiempo de ejecución" de omegahat.org/RCurl/FAQ.html para otras alternativas y debates.

alumno

25

El rvestjunto con xml2es otro paquete popular para analizar páginas web html.

library(rvest)
theurl <- "http://en.wikipedia.org/wiki/Brazil_national_football_team"
file<-read_html(theurl)
tables<-html_nodes(file, "table")
table1 <- html_table(tables[4], fill = TRUE)

La sintaxis es más fácil de usar que el xmlpaquete y para la mayoría de las páginas web, el paquete proporciona todas las opciones que uno necesita.

Dave2e
fuente

Read_html me da el error "'file: ///Users/grieb/Auswertungen/tetyana-snp-2016/data/snp-nexus/15/SNP%20Annotation%20Tool.html' no existe en el directorio de trabajo actual (' / Users / grieb / Auswertungen / tetyana-snp-2016 / code ') ".

scs

Raspado de tablas html en marcos de datos R utilizando el paquete XML

Respuestas: