Reading a backlog of articles in English could be mind-boggling at times. Instead of spending a lengthy amount of hours, I came up with a brilliant idea, which is to extract the most important-looking sentences from academic gobbledygook.
The idea behind the code below is simple. In well-written academic papers, authors tend to put “Thus, “, “Therefore, “, “In sum, “ to summarise their arguments. “Yet, “, “However, “ are the common phrases to introduce counter-intuitive or contradicting facts and opinions. This code is designed to find the sentences that begin with these keywords. (I decided not to include keywords in lowercase for now as the result gets messier due to the uncontrollable variation of English language.)
After extracting the list of important sentences and removing invalid unicode characters in them, this code accesses my Evernote account, open the designated notebook, creates a note with the following information:
note title: “Summary from” + file name
note content: an unordered list of important sentences
note tag: the top 5 most frequently used words excluding “a, the, in, and, etc"
The code I wrote below contains a code that runs “pdf2txt.py”. (I couldn’t get my head around parsing lines of PDF files yet, so this could be a lazy yet highly amateurish approach.) The code below WON't work unless you have pdf2txt.py installed on your machine.
Here’s a simple example using pdf_extractor.py.
1) place a pdf file in the same folder as pdf_extractor.py.</br>
The sample pdf file I used for this blogpost is “Impact_of_the_social_sciences.pdf” which was available on the internet (hope this is not a case of copyright infringement).
2) get your own Evernote developer token and put it in the code.