Saturday, November 10, 2012

Python Please

I have friends who swear by Python.

They rarely program with any other language, and I understand in theory the appeal. You have a language that forces you to write nicely formatted code just to make it work. You do away with redundant structure imposing syntax like braces and semicolons. No longer do you have to waste time formatting someone else's crappy code before you can work with it.

What I can't understand is how they manage to live with its horrendous approach to string processing. Python is a late bound weakly typed interpreted language with an approach to string processing that belongs in C or Assembly language. At that, all the purists are going to cry:

"Just because you don't understand encodings!!!"

Sigh. Yes. Every time I have to work with Python I go and reread Joel's fanatstic article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), just to make sure I haven't missed something. Every time, I confirm that I am not a complete idiot, and I come back to wrestle with python and try and work out where in the process of passing a string around I went wrong.

The problem is partly that I use Python predominantly to scrape webpages. This means that I am always loading up badly formatted text with incomplete or missing meta-data. So to be fair maybe programmers who do not engage in this process never see the problems I see. But it is not only me, look at this thread on Stackoverflow to see how ridiculous the situation is.

I want to propose something to Python enthusiasts. Just say you are right, and the problems are entirely mine (real python programmers love having to monitor string encodings continuously). Ok sure, then:

Why not have a mode for the language that will just force all strings to be a single encoding, say UTF-8 ?

The chorus will yell back, we do. You just do: X-Y-Z

Well people I have tried all of those X-Y-Zs and they do not work. Perhaps again it something to do with my approach. I use a bunch of libraries to process the data, maybe urllib library, or beautiful soup which I use to parse things. I don't know, I am not an expert, and I shouldn't need to be just to parse strings reliably.

I don't understand why it just doesn't work. I have never wasted so much time dealing with string coding problems with any other language than I have with Python.

It should not be so hard. It really shouldn't.