Japanese Transcript Parser
June 2020 - Current
Project Overview
The goal of this project was to learn how to do webscraping and automize file extraction and parsing. Although, there were several problems that I am still trying to fix, the overall mechanism for word extraction works (with minor manual input).
Technologies
In building this project, Java was used.
How the parse worked
Given an input name of "3-gatsu no Lion", the following process was executed:
- JSOUP was used to get the html of the website with a list of Japanese TV shows and corresponding links to a transcript zip download page.
- JSOUP was used to parse the html and extract the link to the "transcript zip download page" for "3-gatsu no Lion".
- Chromium was used to manually go automatically go through the UI of the "transcript zip download page" and download the transcript zip file (which was required since the initial endpoint of the link was a loading screen that later rendered the page with a download button for the zip file).
- The zip file was unziped and the zip file was deleted.
- The Transcript file for the TV show was extracted (manual input because the folder structure was different for each of the TV Shows).
- The Transcript file was filtered (so that only the words (and no metadata) were extracted).
- The Java BreakIterator was used to separate the words from the Transcript file.
- Each word and its frequency were entered into a hashmap.
- Extraction of the most common words was performed.