29 items tagged “unicode”
2021
Re-assessing the automatic charset decoding policy in HTTPX (via) Tom Christie ran an analysis of the top 1,000 most accessed websites (according to an older extract from Google’s Ad Planner service) and found that a full 5% of them both omitted a charset parameter and failed to decode as UTF-8. As a result, HTTPX will be depending on the charset-normalizer Python library to handle those cases.
2019
String length—Rosetta Code
(via)
Calculating the length of a string is surprisingly difficult once Unicode is involved. Here's a fascinating illustration of how that problem can be attached dozens of different programming languages. From that page: the string "J̲o̲s̲é̲"
("J\x{332}o\x{332}s\x{332}e\x{301}\x{332}"
) has 4 user-visible graphemes, 9 characters (code points), and 14 bytes when encoded in UTF-8.
2018
Big tech warns of ’Japan’s millennium bug’ ahead of Akihito’s abdication (via) Emperor Akihito’s abdication in April 2019 triggers a new era, and the Japanese calendar counts years from the coronation of the current emperor. The era hasn’t changed since 1989 and a great deal of software is unable to handle a change. To make things more complicated... the name of the new era will be announced in late February, but it needs to be represented in unicode as a single new character... and the next version of Unicode (v12) is due out in early March. There may have to be a Unicode 12.1 released shortly afterwards that includes the new codepoint.
ftfy—fix unicode that’s broken in various ways (via) I shipped a small web UI wrapper around the excellent Python FTFY library, which can take broken unicode strings and suggest a sequence of operations that can be applied to get back sensible text.
2017
I'm concerned that this character will open the floodgates for an open-ended set of PILE OF POO emoji with emotions, such as CRYING PILE OF POO, PILE OF POO WITH LOOK OF TRIUMPH, PILE OF POO SCREAMING IN FEAR, etc. Is there really any need to add a range of emotions to PILE OF POO? I personally think that changing PILE OF POO to a de facto SMILING PILE OF POO was wrong, but adding F|FROWNING PILE OF POO as a counterpart is even worse. If this is accepted then there will be no neutral, expressionless PILE OF POO, so at least a PILE OF POO WITH NO FACE would be required to be encoded to restore some balance.
The idea that our 5 committees would sanction further cute graphic characters based on this should embarrass absolutely everyone who votes yes on such an excrescence. Will we have a CRYING PILE OF POO next? PILE OF POO WITH TONGUE STICKING OUT? PILE OF POO WITH QUESTION MARKS FOR EYES? PILE OF POO WITH KARAOKE MIC? Will we have to encode a neutral FACELESS PILE OF POO?
2012
What is an intuitive explanation of Unicode and why a programmer needs to know it?
Check out “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” by Joel Spolsky: http://www.joelonsoftware.com/ar...
[... 55 words]2010
Reexamining Python 3 Text I/O. Python 3.1’s IO performance is a huge improvement over 3.0, but still considerably slower than 2.6. It turns out it’s all to do with Python 3’s unicode support: When you read a file in to a string, you’re asking Python to decode the bytes in to UTF-8 (the new default encoding) at the same time. If you open the file in binary mode Python 3 will read raw bytes in to a bytestring instead, avoiding the conversion overhead and performing only 4% slower than the equivalent code in Python 2.6.4.
2009
Unicode code converter (via) Fantastically useful tool to convert strings of characters in to every unicode and/or escaping syntax you can possibly imagine.
Understanding Bidirectional (BIDI) Text in Unicode. It turns out you need to sanitise user input to ensure there are no unicode characters that switch your site’s regular text to RTL.
2008
UnicodeDictWriter—write unicode strings out to Excel compatible CSV files using Python. Stuart Langridge and I spent quite a while this morning battling with Excel. The magic combination for storing unicode text in a CSV file such that Excel correctly reads it is UTF-16, a byte order mark and tab delimiters rather than commas.
Django 1.0 alpha release notes. The big features are newforms-admin, unicode everywhere, the queryset-refactor ORM improvements and auto-escaping in templates.
PortingDjangoTo3k. Martin von Loewis has started assembling a patch. His write-up illustrates some key differences between Python 2.X and Python 3—it looks like Django’s unicode handling is going to require the most work.
2007
Sam Ruby: Ruby 1.9 Strings—Updated. A follow up to yesterday’s post: Sam’s principle complaints about Ruby 1.9’s character encoding support were down to a bug which has now been fixed.
I definitely like Python 3K's Unicode support better [...] In fact, I think I prefer Ruby 1.8's non-support for Unicode over Ruby 1.9's "support". The problem is one that is all to familiar to Python programmers. You can have a fully unit tested library and have somebody pass you a bad string, and you will fall over.
— Sam Ruby
Ruby 1.9—Right for You? Dave Thomas on the just-released Ruby 1.9. It’s a development release that breaks backwards compatibility in a few minor ways, but new features include the YARV virtual machine (hence significant speed improvements) and unicode support via associating encodings with bytestrings.
Unicode code converter (via) Richard Ishida’s tool for converting pretty much any unicode representation to any other.
String types in Python 3. bytes are now immutable (just like the bytestrings they are replacing) and a new mutable buffer type has been introduced.
The larger question is why on earth, in 2007 and ten years after XML came out, we are still using text files that don't label their encoding?
Sam Ruby: 2to3. Sam’s report on an attempt to port the Universal Feed Parser to Python 3.0. The 2to3 tool does most of the work, but it seems the unicode changes can be pretty tricky.
Announcing Babel. Impressive new Python i18n / l10n package, with improved message extraction and a huge amount of bundled locale data.
UnicodeBranch: Porting Applications. A checklist for porting Django applications to handle the new unicode changes. If your application only handles ASCII text at the moment you shouldn’t have to change a thing.
Unicode data in Django. Documentation for Django’s new unicode support.
Django changeset 5609. “Merged Unicode branch into trunk. This should be fully backwards compatible for all practical purposes.”
HTML Entity Character Lookup. Look up HTML entities by characters that are a similar shape.
Django unicode-branch: testers wanted. Malcolm’s outstanding work on the unicode branch appears to be nearing completion.
Translations of My hovercraft is full of eels in many languages (via) Great for unicode testing.
2006
Javascript character set screw-ups (via) Some browsers treat JavaScript files as having the same content-type as the page from which they are linked. This could cause problems with UTF-8 encoded JSON; the workaround is serving up ASCII with unicode escape sequences.
Unicode character information. Useful lookup tool.