Parsing Instagram using R

2015, Jan 17    

Parsing Instagram using R

My previous blogpost was about how to extract data from Kakaostory. I manipluated its R code a little bit to parse html files from Instagram.

How to use

1. Set the right R environment. Go and download R on your PC or MAC, and install XML and CSS libraries.
> install.packages("XML")
> install.packages("CSS")

2. Ready the data. This bit is pretty much the same as Kakao Story. Go to Instagram and find the account you want to crawl. Scroll down until you reach the extent you want to crawl. Right click the screen and select "inspect element"(on Chrome). For a demonstation, I picked Cara Delevinge's instagram, @caradelevingne.

@cara account Cara Delevinge, a famous model

I scrolled down to the first of December, 2014, and hit "inspect element". You'll see loads of "div data-reacted="..."s. That is a container for each post. Go up until you find a div with a class called "PhotoGrid". Right click and copy the div, and paste it on an empty text editor. Save it as an html file. This is what you'll get.

@cara account Scroll down until you get to the right time period.

3. Download the R script file(KSParser.R). You can download it from my github page.

4. Place the script file in the same directory as the html file.

5. Open R or RStudio, and run the script.
> source("ISParser.R")

type in the name of the file, which in my case, 'cara' from cara.html.

6. In about a few seconds, a new file will be created with a prefix "done_". Open the file on Excel and see if everything's okay

@cara account Voila! It’s done and dusted.

Extraction completed! 160 post and their attributes are all completely donwloaded.

Here's my R script code


name_of_file <- readline("type in the name of html file without .html: ")

file_name <- paste(name_of_file, ".html", sep="")

i_doc <- htmlParse(file_name, encoding="UTF-8")

i_root <- xmlRoot(i_doc)

time <- cssApply(i_doc, ".pgmiDateHeader", cssCharacter)

url <- cssApply(i_doc, ".pgmiImageLink", cssLink)

re_url <- paste("", url, sep="")

like_and_comment <- xpathSApply(i_root, "//div[@class='PhotoGridMediaItem']", xmlGetAttr, "aria-label")

like_and_comment_table <- read.table(textConnection(like_and_comment))

likes <- like_and_comment_table[,1]

comments <- like_and_comment_table[,3]

final <- cbind(time, likes, comments, re_url)

renamed_file <- paste("done", name_of_file, sep="_")

xlsx_file <- paste(renamed_file, ".xlsx", sep="")

write.xlsx(final, xlsx_file)