Parsing Kakao Story using R
This blog post is a follow up of my previous post about crawling data from Kakao Story. As I mentioned previously, I extracted data from html sources from Kakao Story, not using its API. By using R's XML and CSS libraries, I made this simple R script by which you can extract published time, content, likes, comments, shares and urls.
How to use
1. Set the right R environment. Go and download R on your PC or MAC, and install XML and CSS libraries.
2. Ready the data. Go to Kakao Story and find the account you want to crawl. Scroll down until you reach the extent you want to crawl. Right click the screen and select "inspect element"(on Chrome). For a demonstation, I chose SK Telecom's offical Kakao Story account.
I scrolled down to the first of November, 2014, and hit "inspect element". You'll see loads of "<div class="section _activity"...>"s. That is a container for each post. Go up until you find a div with a class called "_listContainer". Right click and copy the div, and paste it on an empty text editor. Save it as an html file. This is what you'll get.
3. Download the R script file(KSParser.R). You can download it from my github page.
4. Place the script file in the same directory as the html file.
5. Open R or RStudio, and run the script.
type in the name of the file, which in my case, 'skt' from skt.html.
6. In about a few seconds, a new file will be created with a prefix "done_". Open the file on Excel and see if everything's okay
Extraction completed! 130 post and their attributes are all completely donwloaded.
Here's my R script code