Parsing Kakao Story using R

2015, Jan 13    

Parsing Kakao Story using R

This blog post is a follow up of my previous post about crawling data from Kakao Story. As I mentioned previously, I extracted data from html sources from Kakao Story, not using its API. By using R's XML and CSS libraries, I made this simple R script by which you can extract published time, content, likes, comments, shares and urls.

How to use

1. Set the right R environment. Go and download R on your PC or MAC, and install XML and CSS libraries.
> install.packages("XML")
> install.packages("CSS")

2. Ready the data. Go to Kakao Story and find the account you want to crawl. Scroll down until you reach the extent you want to crawl. Right click the screen and select "inspect element"(on Chrome). For a demonstation, I chose SK Telecom's offical Kakao Story account.

SKT official account SKT’s offical Kakao Story account

SKT official account Scroll down until you get to the right time period.

I scrolled down to the first of November, 2014, and hit "inspect element". You'll see loads of "<div class="section _activity"...>"s. That is a container for each post. Go up until you find a div with a class called "_listContainer". Right click and copy the div, and paste it on an empty text editor. Save it as an html file. This is what you'll get.

SKT official account Scroll down until you get to the right time period.

3. Download the R script file(KSParser.R). You can download it from my github page.

4. Place the script file in the same directory as the html file.

5. Open R or RStudio, and run the script.
> source("KSParser.R")

SKT official account KSParser.R on RStudio

type in the name of the file, which in my case, 'skt' from skt.html.

6. In about a few seconds, a new file will be created with a prefix "done_". Open the file on Excel and see if everything's okay

SKT official account Voila! It’s done and dusted.

Extraction completed! 130 post and their attributes are all completely donwloaded.

Here's my R script code


name_of_file <- readline("type in the name of html file without .html: ")

file_name <- paste(name_of_file, ".html", sep="")

k_doc <- htmlParse(file_name, encoding="UTF-8")

k_root <- xmlRoot(k_doc)

time <- xpathSApply(k_doc, "//a[@class='time _linkPost']", xmlValue)

content <- cssApplyInNodeSet(k_doc, ".fd_cont", ".txt_wrap", cssCharacter)

likes <-cssApply(k_doc, "._likeCount", cssNumeric)

comments <-cssApply(k_doc, "._commentCount", cssNumeric)

shares <-cssApply(k_doc, "._shareCount", cssNumeric)

link <- xpathSApply(k_root, "//a[@class='time _linkPost']", xmlGetAttr, "href")

link_pasted <- paste("", link, sep="")

final <- cbind(time, likes, comments, shares, link_pasted, content)

renamed_file <- paste("done", name_of_file, sep="_")

xlsx_file <- paste(renamed_file, ".xlsx", sep="")

write.xlsx(final, xlsx_file)