Newspaper archives, Arabic unicode, and ligatures, oh my!

I’ve been working on research for about two weeks now, and the roadblocks have inevitably begun.

I started the summer by looking for the parts of my literature review that could use some more work and then moved onto trying to collect data. Unfortunately, the newspapers I thought were available on the LexisNexis database turned out to not go as far back as I needed them to. I’m collecting articles starting from early January 2011, but Al-Ahram is only available from October 2011 and Al Akhbar from 2012. I then turned to the websites themselves, but found that although Al-Ahram’s archives are extensive, they are in a strange PDF/jpeg format that doesn’t allow me to copy and paste the text from articles. Al-Akhbar (state-run) and Al-Shorouq (independent) have PDF archives that do allow me to copy and paste. However, the Arabic language contains many ligatures and special characters that complicate things when I paste the text onto another document. For example, when I first tried copy and pasting, most of the text was transferred properly, but there were several characters that were just question marks inside a black diamond:

Screen Shot 2015-06-16 at 12.45.07 PM


At first, I thought some of the actual Arabic letters had been lost, but I soon realized that these were just extraneous characters so I was able to do a quick find and replace to get rid of them. However, I also noticed another formatting issue, this time due to the Arabic language’s use of ligatures. A ligature is essentially two letters represented as a single character or glyph. Arabic letters are often “stacked,” giving a more calligraphic aesthetic. For the most part, newspapers don’t use too many, as only one ligature is compulsory. The “alif-lam-meem” structure is used in the newspapers.

Screen Shot 2015-06-16 at 2.01.30 PMThe highlighted section shows the “alif,” “lam,” and “meem” with the “lam” and “meem” in a ligature format. This is exactly how the word appears in the newspaper.

Screen Shot 2015-06-16 at 2.16.58 PM  The highlighted section here shows how the word gets formatted once I copy and paste it into a spreadsheet or word document. Instead of “alif,” “lam,” and “meem,” the second two letters have been switched so the word actually starts with “alif,” “meem,” and “lam.”

Screen Shot 2015-06-16 at 2.16.38 PM   Finally, this is how the word should copy over. I manually changed the two letters around. So this and the first image are the same word, represented slightly differently.


Now I’m trying to learn more about Arabic unicode. I’m also trying to figure out how to use Python with Arabic data files. If I’m not able to fix the copy and paste issue in a succinct fashion, I’m hoping I’ll be able to run code that will allow me to find words that start with “alif – meem – lam,” and then manually go back and fix whichever words need to be fixed. I wouldn’t want to have my code do the letter switching since that would run the risk of changing words that should in fact start with those three letters. So far, that’s the only ligature I’ve found in the newspapers. Once I get to hand-coding, I’ll keep an eye out for others.

The last issue I’m having is with my second source of independent newspaper articles, Al Masry Al Youm. The archives are as extensive as I need them to be, but they’re not presented in newspaper format so there’s no way for me to know what was front page news.” This isn’t a huge concern, since it can be worked around with an alternate coding scheme.

I’ll update as I make progress! And in the off chance anyone reading this knows anything about Arabic unicode, I would be super grateful for any and all help!!!!



  1. Sounds frustrating. Figuring out how to research in archives mostly in Spanish seemed difficult, but I cannot imagine figuring out Arabic Unicode.