Jun
25

Change of tools

They say a bad carpenter blames her tools. Well, that’s certainly what I was doing, but luckily, I changed my tools, and I’m making progress!

To try and solve my Arabic Unicode/Python problem, I created a sample .txt file with one article from an independent newspaper. After reading just about every comp sci forum on unicode in Python out there, I still wasn’t able to get Python to print Arabic text. I managed to extract the unicode for a few lines, but when I told it to encode it, it simply printed a blank line.

Frustrated, I decided to see how R would handle the Arabic text. It worked beautifully, and I didn’t even have to worry about encoding.

Next, was the issue of ligatures. While the “meem” and “lam” were still being switched, I realized that the compulsory “lam-alif” ligature was being collapsed to a single letter — the “alif.” This is a huge problem since trying to manually fix that would essentially require me to read every single article word for word. I had been using Google Chrome, and even opening the PDFs in Adobe Reader wasn’t giving me the results I needed. I searched online for ways around this, and came across OCR. OCR (optical character recognition) is software that converts images or PDFs to text. It’s supposed to work on handwriting, so I figured since the text I would be loading in was perfectly printed text, that it would then analyze it accurately. I thought wrong. It ended up being incredibly inaccurate, so I had to keep looking for other solutions. On a whim, I decided to open it up in Firefox. I wasn’t optimistic, especially because the text appeared to be jumbled just by looking at it in the browser. I gave it a shot anyway, copy and pasted, and to my great surprise, it worked! It transferred the ligatures over exactly as they appeared in the original paper.

Another issue I was able to sort out was Al-Ahram‘s non-copy and paste-able PDF archives. I cross checked a front page issue with some of the articles available on Lexis Nexis and realized that the database was actually missing a front page news article. I Google searched the headline from that article and it led me to a plain text version of the article. This is obviously exactly what I need. It’ll be time consuming and annoying to get the articles that way, but at least there is a way!!

There are still some other issues I’m having with at least one another source, but I’m hoping I’ll be able to find a solution or a way to work around them.