How to crawl Kakao Story using R

2015, Jan 06    

How to crawl Kakao Story using R

Kakao Story is a social networking service provided by DaumKakao, and it's been loved by many young Korean people. It has its own dev centre and open APIs, but as I'm not skilled enough to use them, I used R and an html file to crawl data from it. It's really simple and primitive, so if you are already fully familiar with crawling data from Kakao Story, this posting wouldn't be much of a help.

The basic idea is to download a html file and parse it with R. It's really important to have a closer look at the html tag structure and check where your data lies.

Here's how I did.
1. Ready the file
Go to the page you want to crawl. And copy and save the div you want to crawl, and save it as an html file.

2. Read in the file on RStudio.
> library("XML") // importing XML library to use "htmlParse". > doc <- readLines(“html_file.html")
You’ll see an Warning message like ‘incomplete final line found..” but it’s okay.

3. parse the doc. If in Korean, encode with UTF-8
> doc_2 <- htmlParse(doc, encoding="UTF-8")

4. find the div class or id names that you want to crawl.
For example, likeCount is in the div class called “_likeCount”.

5. crawl using CSS library.
> install.packages(“CSS”)
> library(“CSS”)
> likeCounts <- cssApply(doc_2, “._likeCount”, cssCharacter)
You have to put a period(“.”) before the div name. If it’s a string of characters you are going to crawl then the third parameter is “cssCharacter”. Find more about this here. This CSS library is awesome to say the least.
Crawl other info by using the same method.
like.. > urls <- cssApply(doc, ".player>a", cssLink)

6. if the data you’re crawling is missing at some points, then it’s better to use cssApplyInNodeSet to put NAs to make a complete set with other data.

7. use cbind command to make a matrix
> my_matrix <- cbind(dates, likeCounts, contents)

8. export to Excel
To do so, download “xlsx” library and load it.
I tried “write.csv” command, but for some encoding issues occurred and Excel couldn’t load the file properly.
> install.packages(“xlsx”)
> library(“xlsx”)
> write.xlsx(my_matrix, “name_of_your_file.xlsx”)

9. Go to Excel and open it!
Voila! It’s done!