2011-12-11

Python3 の関数アノテーションを使って自動テストする

python3 testing

先日、引数に @Nullable アノテーションが付いた引数をもつ関数をリファクタリングして、関数分割してコミットしたら、ビルドサーバーに仕掛けられた FindBugs™ - Find Bugs in Java Programs に、@Nullable が付いてるのに Null チェックしてないよと怒られました (; ;)

Java のコードに慣れないため、Eclipse のお告げに従ってリファクタリングし、Eclipse がチェックできなかったものを見逃してしまったわけです。もちろん修正するのは簡単だけど、何か恥ずかしい。

ちょっと調べたら、Eclipse プラグインもあるようです *1 。Eclipse に FindBugs プラグインをインストールしてみようー。

。。。

( ﾟдﾟ)ﾊｯ! 間違えた！

今日は 2011 Pythonアドベントカレンダー(Python3) を書くよ！

Python も関数アノテーションが書けるようになりました

PEP 3107 -- Function Annotations によると、Python3 から関数アノテーションを書けるようになりました。

def foo(a: 'x', b: 5 + 6, c: list) -> max(2, 9):
    ...

このサンプルを見ると、式を記述できることを意図してるのか (？)、普通に int や str といった型を表す方が一般的な用途かなと思います。そして、func.__annotations__ にシグネチャがディクショナリとして保持されます。

{'a': 'x',
 'b': 11,
 'c': list,
 'return': 9}

また Python2orPython3 - Python Wiki によると、関数アノテーションは Python 2.x にはバックポートされないようです。Python3 でしか利用できないため、実際に関数アノテーションを書いているコードを私は見たことがありませんでした。

関数アノテーションがあると何が嬉しいの？

そういう方は、先にアドベントカレンダーの3日目 *2 を書かれた @methane の第7回関数アノテーションでスマートにプラスアルファの実現：Python 3.0 Hacks｜gihyo.jp … 技術評論社を読みましょう。

この記事の中では、関数アノテーションを使うと、以下のようなことが簡潔に表現できて嬉しいと紹介されています。

それ自体がドキュメントになる
自動型変換に利用する
overloading（多重定義）を定義する

但し、現在のところ、関数アノテーションは単に情報として保持しているだけです。そのため、このシグネチャをどう使うかはプログラマー次第、そしてサードパーティーのライブラリを待ちましょうという段階のようです。

まだ Python3 が普及していないせいか、関数アノテーションを使って型チェックやバリデーションをしてくれる anntools も開発が活発ではないようです。anntools を使うと、Python 2.x 系もデコレーターで関数アノテーションを追加することができます。とはいえ、この類いの拡張は、 (必要なら) 自分で実装済みだと思うので、そうではない既存のコードをわざわざ修正しようというインセンティブは低いかなと思います。

シグネチャを使って何をするか？

最も分かりやすい利用例としてはテストですね。そこで、ランダム自動テストをやってみましょう。

QuickCheck: An Automatic Testing Tool for Haskell の Python 実装である paycheck が Python3 対応しています。paycheck を使うと、データ駆動テストを簡単に実装できます。本稿では paycheck と nose を使ってランダムなデータ駆動テストをやってみます。

その前に開発環境を作らないと、、、

そう言えば virtualenv も Python3 対応していました。仮想環境を作って、paycheck と nose をインストールします。

$ /opt/local/Library/Frameworks/Python.framework/Versions/3.2/bin/virtualenv --distribute ~/.virtualenvs3/advent
$ ~/.virtualenvs3/advent/bin/easy_install paycheck nose
$ source ~/.virtualenvs3/advent/bin/activate
(advent)$ which python
/Users/t2y/.virtualenvs3/advent/bin/python
(advent)$ python
Python 3.2.2 (default, Nov  5 2011, 19:51:07) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import paycheck

それから IPython も使いたいですね。

$ sudo port install py32-ipython # ipython は MacPorts を使ってる

IPython に virtualenv 環境を考慮したライブラリパスを設定します。このコードはどっかからのコピペです。print 文ではなく print 関数ですよ。

(advent)$ vi ~/.ipython/profile_python3/ipython_config.py
import site
from os import environ
from os.path import join
from sys import version_info

if 'VIRTUAL_ENV' in environ:
    virtual_env = join(environ.get('VIRTUAL_ENV'),
                       'lib',
                       'python%d.%d' % version_info[:2],
                       'site-packages')
    site.addsitedir(virtual_env)
    print('VIRTUAL_ENV ->', virtual_env)
    del virtual_env
del site, environ, join, version_info

(advent)$ ipython3-3.2 
...
VIRTUAL_ENV -> /Users/t2y/.virtualenvs3/advent/lib/python3.2/site-packages
In [1]: import paycheck

はい。準備が整いました。ちゃんとした Python3 環境がなかったんです(> <)

とにかく関数アノテーションを実際に書いてみる

試しに書いてみる。型のみを記述するなら、そんなに気持ち悪くないかな (違和感を感じない) 。

(advent)$ vi others.py
__all__ = ["foo", "bar", "baz"]
 
def foo(a: str, b: int, c: {str: int}, d: float) -> tuple:
    return a, b, c, d

def bar(a: str, b: int, k: str="keyword") -> str:
    return "'{}' + '{}' + '{}'".format(a, str(b), k)

def baz(a: str, b: int, *args: tuple, **kwargs: dict) -> list:
    return [a, b, args, kwargs]

__annotations__ の中身も覗いてみます。

In [2]: foo.__annotations__
Out[2]: {'a': str, 'b': int, 'c': {str: int}, 'd': float, 'return': tuple}

In [3]: bar.__annotations__
Out[3]: {'a': str, 'b': int, 'k': str, 'return': str}

In [4]: baz.__annotations__
Out[4]: {'a': str, 'args': tuple, 'b': int, 'kwargs': dict, 'return': list}

普通のデータ駆動テストをやってみる

先に paycheck の使い方を覚えておきましょう。

(advent)$ vi tests/test_with_paycheck_sample.py 
# -*- coding: utf-8 -*-

from paycheck import with_checker

@with_checker(str, str, number_of_calls=3, verbose=True)
def test_func(a, b):
    assert(isinstance(a + b, str))

こんな感じにコードを書くと test_func の引数にランダムな str 型の文字列を渡してくれます。verbose オプションを True にすると、ランダムに生成された入力値が標準エラー出力に表示されます。

(advent)$ nosetests tests/test_with_paycheck_sample.py 
0: ('64+p57P8:G]NI.B5K', 'b#-O9SS#0#Ohq')
1: ('\\l<?[f$:}ld|1|Y<rd;XEi/^{)`', 'F*#(W,v6h2')
2: ('-9PBxyd(0y6j~/', 'CJMZPEIRn^>~#2')
.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK

応用としては、irange や frange でその型の範囲指定を行ったり、choiceof で任意のリストから値を選択できます。

from paycheck import choiceof, irange, with_checker

@with_checker(irange(1, 10), number_of_calls=3, verbose=True)
def test_func2(i):
    assert(i <= 10)
# 実行結果
0: (9,)
1: (2,)
2: (3,)

@with_checker(choiceof([3, 5]), number_of_calls=3, verbose=True)
def test_func3(i):
    assert(i == 3 or i == 5)
# 実行結果
0: (3,)
1: (5,)
2: (5,)

その他にも positive_float や non_negative_float といったものもあるようです。期待値の実行結果のデータ駆動テストにも便利そうです。

ランダムデータ駆動テストを自動化する

さらにモジュールを自動的に探してきて、そのモジュールで提供されている関数をテストしてくれると便利だったりしないかな？シグネチャさえ分かればできるよ！ようやく関数アノテーションの出番です。

サンプル実装として以下のようなものを作ってみました。テストディレクトリの親ディレクトリから "*.py" ファイルを探してきて、そのモジュールの __all__ で提供されている関数のシグネチャからテストを実行します。

  1 # -*- coding: utf-8 -*-                                                           
  2                                                                                   
  3 import glob                                                                       
  4 import imp                                                                        
  5 import inspect                                                                    
  6 import sys                                                                        
  7 from os.path import (abspath, dirname)                                            
  8                                                                                   
  9 from nose.tools import *                                                          
 10 from paycheck import with_checker                                                 
 11                                                                                   
 12 CHECKER_PARAMETER = {                                                             
 13     "number_of_calls": 3,                                                         
 14     "verbose": True,                                                              
 15 }                                                                                 
 16                                                                                   
 17 def debug(msg: str) -> None:                                                      
 18     sys.stderr.write("{}\n".format(msg))                                          
 19                                                                                   
 20 def get_modules(target_dir: str):                                                 
 21     for pyfile in glob.glob("{}/*.py".format(target_dir)):                        
 22         mod_name = pyfile.split("/")[-1].replace(".py", "")                       
 23         mod = imp.load_module(mod_name, *imp.find_module(mod_name))               
 24         yield mod                                                                 
 25                                                                                   
 26 def get_functions_with_ann(modules):                                              
 27     funcs = (getattr(mod, name) for mod in modules for name in mod.__all__)       
 28     for func in funcs:                                                            
 29         if hasattr(func, '__annotations__'):                                      
 30             yield func                                                            
 31                                                                                   
 32 def test_random_with_paycheck() -> None:                                           
 33     def tester(*args, **kwargs):                                                 
 34         result = func(*args, **kwargs)                                            
 35         ok_(isinstance(result, ret_type))                                         
 36                                                                                   
 37     base_dir = dirname(dirname(abspath(__file__)))                                
 38     for func in get_functions_with_ann(get_modules(base_dir)):                    
 39         debug("target function: {}".format(func.__name__))                        
 40         spec = inspect.getfullargspec(func)                                       
 41         args = spec.args                                                          
 42         if spec.varargs:                                                          
 43             args.append(spec.varargs)                                             
 44         if spec.varkw:                                                            
 45             args.append(spec.varkw)                                               
 46         ret_type = spec.annotations.get("return")                                
 47         types = [spec.annotations[arg] for arg in args]                          
 48         with_checker(*types, **CHECKER_PARAMETER)(tester)()

ディレクトリ構成は以下です。実行してみましょう。

(advent)$ tree .
.
├── others.py
└── tests
    ├── test_with_annotation.py

(advent)$ nosetests tests/test_with_annotation.py 
target function: foo
0: ("O3FND..fOSWv{KWeW:gl8'%k|L", 7607741906685156877, {'': 8791364593896247432, 'A': 7981434242837100514, '>KbMIsq#0kV;U?yxj2s~g,[%LQyrE': -190598769762457072, 'S7J:Um?<{ZtN:L@': -7691133294110638585, '0eV71S07lh~e>rb5P_6zE;5': 1101451838899520496, '*qU4~J*': 6338273523869299236, '|wMLD^\\ysKOw\\c6&S!Be3|hcz': 5053081943822034822, '{C<': 1734444387651285061, '$As^l,_C/av)}1R&HNz7sYd\\1d;.ex': -885374290895090654, '(qs$Ej]f': -8267062632669025484, ")'lOY533cm;jjHP5oI{LVCmRR[": -8668668576751442202, '=rACn7|@C': 478968652357174282, '5SNk0l\\4': -867212168323926037, 'fbB3#+xwU|': 8473818803708212295, 'd2.xgfT.V*<(y': -6515904853273909217, "KGDeofip:[_~M+K~>!'": -3589212816856071640, 'ZgM~': -602505023626250450, "|IJGj~';YFE-1wPPrEs%\\'-h": 4094644477640025745, "r!%n%'qohCttnXe8=7SDi^|t3": 427941587074733809, 'h%': -1809851284353770487}, -0.00023076482227383914)
1: ('/Qhp"NzOc.[|5CiJ', 5190099172656242926, {'': 6382145368304854615, 'x.0?lg@l': -4519850178140629357, 'u?B\\D2': -6081180918953419200, 'w+8inf3XnQ)wF+R8Mx;': -5279979493522305960, '=x0Y"{v': -1051360238739264279, 'LXZv<vV': 8490996434245906021, 'Sa$H*ed^,`$-EZ_%': -6937052124172693463, 'Q);n5': 60653761990170108, "\\`F{`aQ5w'": 1358220429869542064, 'j,,EVP=2WXua8)<oW-W[UngZ8p': 6151527201046578895, "HjY4H:oC'38?.aCO": -5710875614350879758, '0': -3166246628482595309, '#PIc2.': -615037772330393927, 'k%/': -8539311459395790283, 'tx<1': -7016431055285318858, 'Y$"L}EDq&A@msm': -7487772718733717165, 'Epz<eD=qzxRP': -5309516819741565453, 'B>Z95&ON:G>\\rgakkK/XQ^J#': 1080556375731418693, '!x': -8305477197940126401, 'b"m|\\`.$LQ)x`w+q%L6s_a,9\'': -5627647156759687669, 'c': -8050980599323942487, 'K4m\\^HW\\Ki>x_Tr': 1451298324637113436, '9;5uPcy43@7qr[': 7557790634460355432, 'jV': -6775386229302154514, '5Mu[,g': 7789805996343655479, 'ln1MH2qtO-(#8@l_W]P': 7934835116394274442, 'Di64M>{;(t\\/YJ4=Q*"X^>qowh': 3744629399181575512, '7].i': -1231696801069995861}, 0.021354448475725422)
2: ('@KGvLsf{CXEkwudbb$&a>t?`q&-tL', 2813673244267029793, {'m4#3<\\^8=tK': 2445679757000420077}, -0.03955141006906784)

target function: bar
0: ('X9|wG.n+xJ1Uzj?`q]+\\6>C"8_', 7102757083111770696, '%Qd|@')
1: ('fw"F', -508039826724708831, 'v0W6a}u[""@#?o;ziXOd-eFv=+"')
2: ('AUI6|BTLp%1K$u', -3393106434267748224, 'O.')

target function: baz
.
----------------------------------------------------------------------
Ran 1 test in 0.005s

OK

ちゃんとカレントディレクトリの others.py を探し出してテストを実行してくれました。

おや！？

foo と bar はテストが実行されているけど、baz のテストは実行されていないようです。

def foo(a: str, b: int, c: {str: int}, d: float) -> tuple:
  ...

def baz(a: str, b: int, *args: tuple, **kwargs: dict) -> list:
  ...

詳しく調べていませんが、paycheck に渡すタプルやディクショナリは (int, int) や {str: str} といった記述をしないと、入力となるテストデータを生成してくれないようです。

次にテスト関数をみてみます。

 33     def tester(*args, **kwargs):
 34         result = func(*args, **kwargs)
 35         ok_(isinstance(result, ret_type))

このテストで検証できるのは、様々な入力データに対して以下の内容です。

関数を実行してエラーが発生しない
期待した型の返り値が取得できる

つまり、予期していない入力データによるエラーがないことをテストできます。

また with_checker へ渡す型情報の引数は、テストする関数に指定された引数の順番通りに指定する必要があります。

...
 40         spec = inspect.getfullargspec(func)                                       
 41         args = spec.args
...
 47         types = [spec.annotations[arg] for arg in args]                          
 48         with_checker(*types, **CHECKER_PARAMETER)(_tester)()

inspect.getfullargspec を使うと、アノテーションも含めた関数の全情報を取得できます。引数の順番が保持されたリストを取得したり、可変長引数 (*args や **kwargs) の有無も分かります。

In [12]: inspect.getfullargspec(baz)
Out[12]: FullArgSpec(args=['a', 'b'], varargs='args', varkw='kwargs', defaults=None, kwonlyargs=[], kwonlydefaults=None, 
         annotations={'a': <class 'str'>, 'b': <class 'int'>, 'args': <class 'tuple'>, 'return': <class 'list'>, 'kwargs': <class 'dict'>})

まとめ

関数アノテーションはドキュメントとしても有用ですし、静的解析のテクニックを応用したライブラリ等も今後出てくるでしょう。ふと気付いたことで、ジェネレーターを表すアノテーションが分かりませんでした。まだ決まってないのかな。

それと、初めて paycheck を使ってみましたが、関数アノテーションと組み合わせて相性の良さそうなところが見えました。1点だけ残念だったのは、with_checker 内でエラーが発生すると、例外を発生させて、そこでテストが終了してしまう点です。データ駆動テストとしては、ある入力データのテストがエラーになっても、その他の全入力データの結果もまとめて見れた方が便利です。ちょっと使ってみて、その点を改善できると良いなと思いました。あとドキュメントもほしいです。

次のアドベントカレンダーは @torufurukawa です。以前から Python3 の発表をされていたので楽しみです。

*1:第7章 FindBugs™ Eclipse プラグインの使用方法

*2:Python3 Advent Calender 3日目 - New GIL を理解する - methaneのブログ

2011-04-02

bdist_rpm から Py2RPM へ

translate python3

bdist_rpm is dead, long life to Py2RPM

本稿は上記リンク元の和訳になります
転載ミス、誤訳等については適宜修正します

Python のパッケージングに取り組む中で私が学んだことの1つは、Distutils が利用できるプロジェクトで RPM をビルドする bdist_rpm のような標準ライブラリのスコープツールをメンテナンスすることは限りなく不可能に近いということです。

私が尋ねて回ってみたところ、日々の RPM パッケージングに bdist_rpm を使っていないパッケージャは10人中9人でした。但し、彼らはパッケージングを行うプロジェクトの開発者でもありません。彼らは独自のツールを使います。

いくつか理由があります。

bdist_rpm は spec ファイルを作成するときにたくさんの前提条件を必要とする。
生成した spec ファイルのセクションをカスタマイズする方法がない。
spec ファイルの作成は、実際に使われる前にパッケージャが時間をかけて編集します。そのため、Python のメタデータから RPM のメタデータを完璧に自動変換することはあまり良い考えではありません。そして、パッケージャはテンプレートをもっているので、最終的には手動で spec ファイルを作成してしまいます。
最新の Fedora ではなく RedHat 5 向けにはどうすれば良いのでしょうか。
Python の標準ライブラリのサイクルは、ディストリビューションのサイクルとは一致しません。そのため、我々が標準ライブラリをアップデートしても、それはすぐに古くなって非推奨になるでしょう。

だから、Distutils に bdist_deb コマンドを追加しようと数ヶ月前に提案されたときに、deb ファイルの作成は独立したプロジェクトで行うべきだと私は指摘しました。そして、いまは stdeb がその役割を担っているように見えます。

Windows の場合は少し違っていて、bdist_msi のようなツールはメンテナンスが簡単で、win32 の世界にはあまりパッケージングの "味付け" がなく、たった1つです。Python の標準ライブラリのリリースサイクルは、ここでは確かにうまくいきます。

Distutils から bdist_rpm を排除しようという原則を支持してください。

そのため、RPM も同様に Python 使いに適切な RPM を提供する独立したプロジェクトになって、Distutils2 (Python 3.3 では packaging モジュールになる) の先頭で bdist_rpm を置き換えます。

現実の問題として、私は Mozilla でのお仕事にこのプロジェクトを少し前に始めて、そのときは pypi2rpm と呼んでいました。PyPI でリリースされたプロジェクトから CentOS の RPM パッケージを作成することが主な機能なのでそう呼びました。

試してみよう。

$ pip install pypi2rpm
$ pypi2rpm.py SQLAlchemy
...ログ出力...
/tmp/ok/python26-sqlalchemy-0.6.6-1.noarch.rpm written

このスクリプトは PyPI をブラウズしてバージョンをソートするために Distutils2 を使った後で rpmbuild を実行する bdist_rpm のカスタムバージョンを使います。さらに、このスクリプトは、既存プロジェクトにある spec ファイルを使って RPM を作成できます。それは setup.py を迂回します。

しかし、このスクリプトは私が自分の目的のために作成した簡単なスクリプトです。長期的には、Py2RPM に名前を変更して、bdist_rpm を置き換えます。提供したい機能は次の通りです。

コマンドラインから設定を変更して RPM を作成できる単独で実行可能なスクリプト
setup.cfg ファイルから spec ファイルを生成できる単独で実行可能なスクリプト
setup.cfg から全てのオプションを読み込み、名前やバージョンといった重複したメタデータのフィールドを除外する
RPM ベースのディストリビューション間の違いを認識する
名前が既に使われているか、ファイルが競合していないかなどを、CentOS や Fedora といった様々なリポジトリに対して RPC 経由で呼び出す
distutils2 コマンドなので、そのスクリプトは必要に応じてトップレベルで pysetup から呼び出せる

サイドノート (これは別のブログの記事) では、setup.cfg は spec ファイルになり、その spec ファイルはバージョン管理されます。{ここにパッケージングシステムの名前を置いて}、これにより RPM の世界にあるツールを使って、似たような機能を提供できます。そして、このファイルは、そのリリースプロダクトの tarball をダウンロードせずに連携できるので、すぐに PyPI で利用できるようになります。

Py2RPM に興味があるなら私まで連絡ください。私は全く RPM の専門家ではないので、一緒にプロジェクトへ参加してくれると嬉しいです。

2011-03-05

Python で日本語を含むリストやディクショナリの表示にもの思い

python python3

大学の友人が Python を学び始め、デバッグしていて抱いた疑問をググったところ、以下の解決方法を見つけたようです。

#python で日本語を含むlistとかdictとかを人間が読めるようにprettyprintする方法はURLが王道なのだろうか。

2011-03-04 21:50:02 via web

Pythonでコード書いてると、１回は残念だなぁと思うポイントとして表題の件があると思います。具体的には以下です。
# リストも辞書も出力がお世辞にも良いとは言えない。。
>>> print ['あ', 'い', 'う']
['\xe3\x81\x82', '\xe3\x81\x84', '\xe3\x81\x86']
>>> print {'title':'ねじまき鳥', 'author':'村上春樹'}
{'author': '\xe6\x9d\x91\xe4\xb8\x8a\xe6\x98\xa5\xe6\xa8\xb9', 'title': '\xe3\x81\xad\xe3\x81\x98\xe3\x81\xbe\xe3\x81\x8d\xe9\xb3\xa5'}
日本語がバイト表現な上、全要素が１行で表示されています。これではちょっとprintデバッグするにも萎えますよね。複雑な辞書を出力した場合なんかは、出力された文字列の整形にかなりのパワーを裂かれること請け合いです。
Pythonで日本語を含んだリストと辞書をpretty printしたい件 | taichino.com

Python 2.x 系では、オブジェクトの文字列表現、つまりリストやディクショナリから str 型に変換されるときに非 ASCII 文字が含まれていると、それらがエスケープされてしまうために上記のような表示になってしまいます。どうしても日本語を読める形で表示したいなら、ちょっと面倒な感じですが、以下のようにして目的を実現できます。

$ python2.6
Python 2.6.6 (r266:84292, Mar  3 2011, 19:50:05) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print ", ".join(['あ', 'い', 'う'])
あ, い, う
>>> print "\n".join("%s: %s" % i for i in {'title':'ねじまき鳥', 'author':'村上 春樹'}.items())
author: 村上春樹
title: ねじまき鳥

個々の要素を文字列オブジェクトに変換した上で print 文に渡すことで読める形の日本語を表示しています。

前々から何とかならないかと、悶々とした日々を過ごしていたのですが、先ほど駄目元で以下のコードを試してみたところ無事pretty printに成功しました。一旦json形式にした文字列を、evalでunicode文字列として再評価しています。そうする事でエスケープ表記されたunicode文字からunicodeオブジェクトを作っています。無理矢理感は否めませんが、背に腹は代えられないでしょう！
Pythonで日本語を含んだリストと辞書をpretty printしたい件 | taichino.com

さらに元記事では unicode 文字列が出てきて煩雑さを助長してしまっているのですが、インタープリタの対話モードで日本語を読める形で表示したいという用途だけであれば、unicode 文字列は本質的に関係ありません。まずは次の出力を見てください。

>>> 'あ'
'\xe3\x81\x82'
>>> print 'あ'
あ

str 型の 'あ' をインタープリタ上で入力したときと print 文に渡したときで表示が違います。print 文はデフォルトで sys.stdout に出力するので、'あ' が表示されているのはターミナルの環境変数 (LANG) に設定された文字コードでエンコードして表示されているのかなと思います(この辺は怪しいので間違ってたらツッコミください)。

>>> import sys
>>> sys.stdout.write('あ\n')
あ

ここで unicode 文字列を print する場合も考えてみます。

>>> unicode('あ', 'utf-8')
u'\u3042'
>>> print unicode('あ', 'utf-8')
あ

やはり 'あ' が表示されますね。

はまりどころ：printステートメント
先ほどは、日本語を確実に出力するために、
print u1.encode('utf_8')
のようにしましたが、実は、単に
print u1
ように直接unicode型を渡しても、unicode型からstr型へのエンコードが行われます。たいていは正しく日本語を出力できます。
この場合には、先ほどのsys.getdefaultencoding()で得られる文字コード（エンコーディング）ではなく、環境変数LANG等のロケールで設定された文字コードを使ってエンコードされます。
このとき使用されるエンコーディングは、sys.stdout.encodingから参照することができます。
PythonのUnicodeEncodeErrorを知る - HDEラボ

によると、unicode 文字列の場合は print 文が実行される過程でエンコードしてから出力してくれるようです。おそらくは内部的に以下のような処理と同じことが行われているのだと思います。

>>> sys.stdout.encoding
'UTF-8'
>>> sys.stdout.write(unicode('あ\n', 'utf-8').encode(sys.stdout.encoding))
あ

前置きが長くなってしまいましたが、標準出力に出力するときに unicode 文字列をそのまま出力することはできないので、結局のところ、環境変数 LANG から得られた文字コード(私の環境では UTF-8)でエンコードすることになります *1 。そのため、(UTF-8 でエンコードされた) str 型の文字列をそのまま使っても、unicode 文字列にデコードしてから使っても print 文で表示する場合は同じことになります。

Python 2.6 以上では str.format を使用する方がお奨めなので以下のように書くこともできます。

>>> print "{0}, {1}, {2}".format(*['あ', 'い', 'う'])
あ, い, う
>>> print "title: {title}, author: {author}".format(**{'title':'ねじまき鳥', 'author':'村上春樹'})
title: ねじまき鳥, author: 村上春樹

ちなみに Python3 では @atsuoishimoto による PEP 3138 のおかげでリストやディクショナリのオブジェクト文字列表現における非 ASCII 文字はエスケープせずに表示してくれます。詳細は Python 3 のオブジェクト文字列表現 - O'Reilly Japan Community Blog や String representation in py3k を参照してください。

$ python3.2
Python 3.2 (r32:88445, Mar  5 2011, 00:53:25) 
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(['あ', 'い', 'う'])
['あ', 'い', 'う']
>>> print({'title':'ねじまき鳥', 'author':'村上春樹'})
{'author': '村上春樹', 'title': 'ねじまき鳥'}

よく言われる Python3 に移行すれば何とやら、、、ですね(^ ^;;

個人的には割と使用頻度が高いので、prettyprintモジュールにしてpypiに登録しました。インストールはeasy_installから。ソースはgithubから。
Pythonで日本語を含んだリストと辞書をpretty printしたい件 | taichino.com

最後にこれらを踏まえた上で pretty-print を提供する標準ライブラリ pprint を使いましょう。

*1:私の環境は MacOS X と Linux ですが Windows も unicode 文字列を直接出力しませんよね？

2009-08-15

リストからディクショナリへの変換の最適化にみる賢明な Python プログラミング

python python3 profile

Python クックブック第2版「4.12 キーと値が交互に入ったリストから dict を構築」というレシピがあります。
原典: Dicts from lists « Python recipes « ActiveState Code

リストからディクショナリを生成する方法として、zip() と dict() 関数を組み合わせた簡潔な方法を紹介しています。

def DictFromList(myList):
    return dict(zip(myList[:-1:2], myList[1::2]))

これは簡潔で素晴らしいなと感心していたら、コメント(クックブック)には、この方法よりも高速且つ汎用性の高い方法として、ジェネレータを用いた方法も紹介されていました。

def pairwise(iterable):
    itnext = iter(iterable).next
    while 1:
        yield itnext(), itnext()

イテラブルなシーケンスに対して適用できるので汎用性が高くなり、ジェネレータを用いることで巨大なリストからの変換に対してメモリの消費量も少なくなります。クックブックの考察によると、このコードにはさらにもう1つトリックがあるようです。以下に引用します。

pairwise はちょっとおもしろい実装になっている。第1文では、引数 iterable に対してビルトイン関数 iter をコールし、これにより得られる反復子の結合メソッド next を、ローカル名 itnext に結合している。ちょっと奇妙に思われるかもしれないが、Python では一般性が高く、良いテクニックである。あるオブジェクトが存在し、このオブジェクトに対してやりたいことが、ループ中での結合メソッドのコールのみであれば、そのメソッドを抽出、ローカル名に代入しておき、このローカル名を関数であるかのようにコールするのだ。
Amazon.co.jp： Python クックブック第2版: Alex Martelli, Anna Martelli Ravenscroft, David Ascher, 鴨澤眞夫, 當山仁健, 吉田聡, 吉宗貞紀: 本

上記のコードと動作は同じですが、以下のコードは処理速度が約60%遅くなるとあります。

def pairwise_slow(iterable):
    it = iter(iterable)
    while 1:
        yeild it.next(), it.next()

さらにクックブックを引用します。

シンプルさと明白性に焦点を合わすのは良いこと、というか素晴らしいことだ - 実際それは Python の中心原則である。しかし何の見返りもなくパフォーマンスを投げ捨てることは、まったく別の話であり、どんな言語においても推奨されることのない習慣だ。つまり、正しくクリアでシンプルなコードを書くことに注力するのは素晴らしいことだが、自分のニーズに最も適切な Python イディオムを学んで使うことも、極めて賢明なことなのだ。
Amazon.co.jp： Python クックブック第2版: Alex Martelli, Anna Martelli Ravenscroft, David Ascher, 鴨澤眞夫, 當山仁健, 吉田聡, 吉宗貞紀: 本

良いレシピに出会えて幸せな気分ですね(^ ^;;
以下に、私の環境(MacOS X 10.5.8, 2GHz Core2Duo, 2GB 667MHz DDR2)における、リストの要素1万、100万、1000万のときの各々の方法によるプロファイル結果を Python2.6 と Python3.1 で計測してみました。リストの要素数が大きくなるにつれて、処理速度に差が出ることが分かります。
また、リスト要素1000万のときの、Python2.6 と Python3.1 のプロファイル結果を見ると、Python3.1 では大幅に改善されていることが見て取れます。

実行結果。

$ python2.6 profile_time_convert_dict26.py 
Running dict_from_list which has 10000 items took 0.010 seconds
Running dict_from_sequence which has 10000 items took 0.000 seconds
Running dict_from_list which has 1000000 items took 0.960 seconds
Running dict_from_sequence which has 1000000 items took 0.350 seconds
Running dict_from_list which has 10000000 items took 57.260 seconds
Running dict_from_sequence which has 10000000 items took 3.390 seconds

$ python3.1 profile_time_convert_dict31.py 
Running dict_from_list which has 10000 items took 0.004 seconds
Running dict_from_sequence which has 10000 items took 0.002 seconds
Running dict_from_list which has 1000000 items took 0.742 seconds
Running dict_from_sequence which has 1000000 items took 0.317 seconds
Running dict_from_list which has 10000000 items took 8.648 seconds
Running dict_from_sequence which has 10000000 items took 3.478 seconds

python2.x のソースコード。

#!/bin/env python

import time
import random

item_num = [10000, 1000000, 10000000]

def main():
    get_profile(item_num, dict_from_list, dict_from_sequence)

def get_profile(num_list, *funcs):
    totals = {}
    for num in num_list:
        kav_list = [ i for i in range(num) ]
        for func in funcs:
            totals[func] = 0.0
            starttime = time.clock()
            apply(func, [kav_list])
            stoptime = time.clock()
            elapsed = stoptime - starttime
            totals[func] = totals[func] + elapsed
             
            print "Running %s which has %d items took %.3f seconds" % (
                func.__name__, num, totals[func])

def dict_from_list(key_and_values):
    return dict(zip(key_and_values[::2], key_and_values[1::2]))

def dict_from_sequence(seq):
    def pairwise(iterable):
        itnext = iter(iterable).next
        while True:
            yield itnext(), itnext()
    return dict(pairwise(seq))

if __name__ == '__main__':
    main()

python3 のソースコード。

#!/bin/env python

import time
import random

item_num = [10000, 1000000, 10000000]

def main():
    get_profile(item_num, dict_from_list, dict_from_sequence)

def get_profile(num_list, *funcs):
    totals = {}
    for num in num_list:
        kav_list = [ i for i in range(num) ]
        for func in funcs:
            totals[func] = 0.0
            starttime = time.clock()
            func(*[kav_list])
            stoptime = time.clock()
            elapsed = stoptime - starttime
            totals[func] = totals[func] + elapsed
             
            print("Running %s which has %d items took %.3f seconds" % (
                func.__name__, num, totals[func]))

def dict_from_list(key_and_values):
    return dict(list(zip(key_and_values[::2], key_and_values[1::2])))

def dict_from_sequence(seq):
    def pairwise(iterable):
        itnext = iter(iterable).__next__
        while True:
            yield itnext(), itnext()
    return dict(pairwise(seq))

if __name__ == '__main__':
    main()

2to3 の diff。

$ 2to3 -w -n profile_time_convert_dict31.py
RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
 --- profile_time_convert_dict31.py (original)
 +++ profile_time_convert_dict31.py (refactored)
 @@ -15,20 +15,20 @@ 
          for func in funcs:
              totals[func] = 0.0
              starttime = time.clock()
 -            apply(func, [kav_list])
 +            func(*[kav_list])
              stoptime = time.clock()
              elapsed = stoptime - starttime
              totals[func] = totals[func] + elapsed
               
 -            print "Running %s which has %d items took %.3f seconds" % (
 -                func.__name__, num, totals[func])
 +            print("Running %s which has %d items took %.3f seconds" % (
 +                func.__name__, num, totals[func]))
 
  def dict_from_list(key_and_values):
 -    return dict(zip(key_and_values[::2], key_and_values[1::2]))
 +    return dict(list(zip(key_and_values[::2], key_and_values[1::2])))
  
  def dict_from_sequence(seq):
      def pairwise(iterable):
 -        itnext = iter(iterable).next
 +        itnext = iter(iterable).__next__
          while True:
              yield itnext(), itnext()
      return dict(pairwise(seq))
 RefactoringTool: Files that were modified:
 RefactoringTool: profile_time_convert_dict31.py

リファレンス:
Python のループ処理の最適化 - forest book

Python クックブック第2版

作者: Alex Martelli,Anna Martelli Ravenscroft,David Ascher,鴨澤眞夫,當山仁健,吉田聡,吉宗貞紀
出版社/メーカー: オライリー・ジャパン
発売日: 2007/06/26
メディア: 大型本
購入: 11人クリック: 423回
この商品を含むブログ (85件) を見る

2009-08-07

Python のループ処理の最適化

python python3 profile

元ネタ: このページは削除されました

これが「やっぱPythonですって」と言いたい人の一助になれば完璧。
さくらのブログ

これは素晴らしい結果です(^ ^;;
しかしながら「ランダム数値リスト作成」で僅かに Ruby に遅れを取っています。
以下がそのソースコードになります。

    list = []
    start = time.time()
        for i in range( cnt ):
           list.append( random.randint(0, 2147483647) )
    print( time.time() - start )

Python プログラムソース
単純にリストに値を追加していく処理の比較と言うのであれば正しい処理です。しかし、さらに Python にはリスト内包表記という素晴らしい仕組みがあります。

Learning Python 3rd Edition, Generator Expression Iterators Meet List Comprehensions によると、実行するループ処理の特性にも依りますが、for ループよりもリスト内包表記の方が最適化されているから速いとあります。

もし可能であるならば、上述のソースコードを以下のように修正して比較してみてほしいです。

    start = time.time()
    list = [ random.randint(0, 2147483647) for i in range(cnt) ]
    print( time.time() - start )

以下が、私の環境(MacOS X 10.5.7, 2GHz Core2Duo, 2GB 667MHz DDR2)におけるプロファイル結果になります。python3 ではループ処理そのものも高速化されているのが分かります。

python 2.6.2
$ python profile_list_operation_26.py 100000
it took 5.347 seconds to run for_statement 10 times 
it took 5.278 seconds to run list_comprehension 10 times 
it took 5.643 seconds to run map_function 10 times 
it took 5.319 seconds to run generator_expression 10 times 

python 3.1
$ python3.1 profile_list_operation_31.py 100000
it took 4.314 seconds to run for_statement 10 times 
it took 4.077 seconds to run list_comprehension 10 times 
it took 4.456 seconds to run map_function 10 times 
it took 4.151 seconds to run generator_expression 10 times

python2.x のソースコード。

#!/bin/env python

import sys
import time
import random

rand_max = 2147483647

def main():
    get_profile(10, int(sys.argv[1]), for_statement, 
        list_comprehension, map_function, generator_expression)

def get_profile(pro_times, max, *funcs):
    totals = {}
    for func in funcs:
        totals[func] = 0.0
        starttime = time.clock()
        for x in range(pro_times):
            apply(func, [max])
        stoptime = time.clock()
        elapsed = stoptime - starttime
        totals[func] = totals[func] + elapsed
    for func in funcs:
        print "it took %.3f seconds to run %s %d times " % (
                totals[func], func.__name__, pro_times)

def for_statement(max):
    l = []
    for i in range(max):
        l.append(random.randint(0, rand_max))

def list_comprehension(max):
    l = [ random.randint(0, rand_max) for i in range(max) ]

def map_function(max):
    l = map((lambda x: random.randint(0, rand_max)), range(max))

def generator_expression(max):
    l = list(random.randint(0, rand_max) for x in range(max))

if __name__ == '__main__':
    main()

python3 のソースコード。

#!/bin/env python

import sys
import time
import random

rand_max = 2147483647

def main():
    get_profile(10, int(sys.argv[1]), for_statement, 
        list_comprehension, map_function, generator_expression)

def get_profile(pro_times, max, *funcs):
    totals = {}
    for func in funcs:
        totals[func] = 0.0
        starttime = time.clock()
        for x in range(pro_times):
            func(*[max])
        stoptime = time.clock()
        elapsed = stoptime - starttime
        totals[func] = totals[func] + elapsed
    for func in funcs:
        print("it took %.3f seconds to run %s %d times " % (
                totals[func], func.__name__, pro_times))

def for_statement(max):
    l = []
    for i in range(max):
        l.append(random.randint(0, rand_max))

def list_comprehension(max):
    l = [ random.randint(0, rand_max) for i in range(max) ]

def map_function(max):
    l = list(map((lambda x: random.randint(0, rand_max)), list(range(max))))

def generator_expression(max):
    l = list(random.randint(0, rand_max) for x in range(max))

if __name__ == '__main__':
    main()

2to3 の diff。

$ 2to3 -w -n profile_list_operation_31.py 
RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
--- profile_list_operation_26.py (original)
+++ profile_list_operation_31.py (refactored)
@@ -16,13 +16,13 @@
         totals[func] = 0.0
         starttime = time.clock()
         for x in range(pro_times):
-            apply(func, [max])
+            func(*[max])
         stoptime = time.clock()
         elapsed = stoptime - starttime
         totals[func] = totals[func] + elapsed
     for func in funcs:
-        print "it took %.3f seconds to run %s %d times " % (
-                totals[func], func.__name__, pro_times)
+        print("it took %.3f seconds to run %s %d times " % (
+                totals[func], func.__name__, pro_times))
 
 def for_statement(max):
     l = []
@@ -33,7 +33,7 @@
     l = [ random.randint(0, rand_max) for i in range(max) ]
 
 def map_function(max):
-    l = map((lambda x: random.randint(0, rand_max)), range(max))
+    l = list(map((lambda x: random.randint(0, rand_max)), list(range(max))))
 
 def generator_expression(max):
     l = list(random.randint(0, rand_max) for x in range(max))
RefactoringTool: Files that were modified:
RefactoringTool: profile_list_operation_31.py

2009-05-19

python3 でテキストデータを表示するスライドショウを作る : 1

python3

呼称: テキストデータをコンソールに表示するスライドショウ第1版
目的: プレゼンでコーディングデモをスムーズに行うツールを開発する
特徴: ファイルを読み込んで標準出力へ出力する
用例: 特になし
備考: 第1版は cat コマンドのようなもの

JUS の Haskell 勉強会で id:nobsun が説明されていたスライドショウのツールを python で作成してみることにしました。いきなり最終形を作るのではなく、段階的に機能を追加して開発するスタイルがおもしろいなと感じました。私もそれを踏襲してやってみます。おそらく標準モジュールだけで作成できそうな気がするので、python3 で挑戦してみます。
実際は、普通にコーディングして、その後で 2to3 で変換しているだけです(^ ^;;

#!/usr/bin/env python
"""
SlideShow 01 with Python3
"""

import sys

#########################################################################
# Exit Status Value
#########################################################################
status = [
    'normal'
  , 'invalid_args'
  , 'open_file'
]

#########################################################################
# Script Main
#########################################################################
def main():
    try:
        check_option()
        f = open_file(sys.argv[1], 'r', True)
        for line in f:
            print(line, end='')
        f.close()
    except OptionError as err:
        sys.exit(status.index(err.status))
    except EnvError as err:
        sys.exit(status.index(err.status))
   
    sys.exit(status.index('normal'))

#########################################################################
# Functions
#########################################################################
def check_option():
    if len(sys.argv) < 2:
        raise OptionErrorUsage('invalid arguments')

def open_file(file, mode, do_exit=None):
    try:
        f = open(file, mode)
    except IOError as err:
        f = Null()
        if do_exit:
            raise EnvErrorOpenFile(err)
    return f

#########################################################################
# Class Definition
#########################################################################
class Null(object):
    def __new__(cls, *args, **kwargs):  # Singleton for 1 instance
        if '_inst' not in vars(cls):
            cls._inst = super(Null, cls).__new__(cls, *args, **kwargs)
        return cls._inst   
    def __init__(self, *args, **kwargs): pass
    def __call__(self, *args, **kwargs): return self
    def __repr__(self): return 'Null()'
    def __iter__(self): return iter(())
    def __bool__(self): return False
    def __getattr__(self, name): return self
    def __setattr__(self, name): return self
    def __delattr__(self, name): return self

#########################################################################
# Exceptions
#########################################################################
class MyError(Exception):
    """Base class for all exceptions"""
    def __init__(self, msg, value=None):
        self.status = value
        if msg:
            sys.stderr.write('%s\n' % (msg))

class OptionError(MyError): pass
class OptionErrorUsage(OptionError):
    def __init__(self, msg):
        OptionError.__init__(self, msg, 'invalid_args')
        self.usage()
    def usage(self):
        print("Usage: %s filename" % (sys.argv[0]))

class EnvError(MyError): pass
class EnvErrorOpenFile(EnvError):
    def __init__(self, err):
        msg = 'Cannot open: %s' % (err.filename)
        EnvError.__init__(self, msg, 'open_file')


if __name__ == '__main__':
    main()

実行結果。

$ python3.0 SlideShow01_30.py t.txt 
a
bb
ccc
dddd
eeeee

$ python3.0 SlideShow01_30.py
invalid arguments
Usage: SlideShow01_30.py filename

$ python3.0 SlideShow01_30.py detarame
Cannot open: detarame

2to3 による diff 出力。

--- SlideShow01_26.py (original)
+++ SlideShow01_26.py (refactored)
@@ -22,11 +22,11 @@
         check_option()
         f = open_file(sys.argv[1], 'r', True)
         for line in f:
-            print line,
+            print(line, end='')
         f.close()
-    except OptionError, err:
+    except OptionError as err:
         sys.exit(status.index(err.status))
-    except EnvError, err:
+    except EnvError as err:
         sys.exit(status.index(err.status))
    
     sys.exit(status.index('normal'))
@@ -36,15 +36,15 @@
 #########################################################################
 def check_option():
     if len(sys.argv) < 2:
-        raise OptionErrorUsage, 'invalid arguments'
+        raise OptionErrorUsage('invalid arguments')
 
 def open_file(file, mode, do_exit=None):
     try:
         f = open(file, mode)
-    except IOError, err:
+    except IOError as err:
         f = Null()
         if do_exit:
-            raise EnvErrorOpenFile, err
+            raise EnvErrorOpenFile(err)
     return f
 
 #########################################################################
@@ -59,7 +59,7 @@
     def __call__(self, *args, **kwargs): return self
     def __repr__(self): return 'Null()'
     def __iter__(self): return iter(())
-    def __nonzero__(self): return False
+    def __bool__(self): return False
     def __getattr__(self, name): return self
     def __setattr__(self, name): return self
     def __delattr__(self, name): return self
@@ -80,7 +80,7 @@
         OptionError.__init__(self, msg, 'invalid_args')
         self.usage()
     def usage(self):
-        print "Usage: %s filename" % (sys.argv[0])
+        print("Usage: %s filename" % (sys.argv[0]))
 
 class EnvError(MyError): pass
 class EnvErrorOpenFile(EnvError):