pandas.read_sql的误导性参数chunksize

2017-12-07

官方文档的描述是该参数返回一个迭代的对象，该对象包含很多个chunksize大小的块。容易误导人的是设置chunksize之后，从数据获取数据就不会一次返回所有的数据，而是分块的返回。

更烦的是这个情况还不容易察觉。当使用sqlalchemy+pymssql连接MSSQL导出一张100W行数据的表，chunksize给人的表现看起来就是每次只获取部分，内存并没有太大变化，而当使用sqlalchemy+mysql连接MySQL时，就会发现内存疯涨，究其原因发现正是因为mysql获取了从MySQL获取所有数据到内存之后组装成结果集然后游标对该结果集进行操作。

网上有很多人遇到过同样的问题：

解决办法就是使用sqlalchemy的执行命令时设置execution_options且stream_results参数为True, 或者使用服务器游标:create_engine()时设置参数server_side_cursors=True .

stream_results – Available on: Connection, statement. Indicate to the dialect that results should be “streamed” and not pre-buffered, if possible. This is a limitation of many DBAPIs. The flag is currently understood only by the psycopg2, mysqldb and pymysql dialects.

Server-side cursor support is available for the MySQLdb and PyMySQL dialects. From a MySQL point of view this means that the MySQLdb.cursors.SSCursor orpymysql.cursors.SSCursor class is used when building up the cursor which will receive results. The most typical way of invoking this feature is via theConnection.execution_options.stream_results connection execution option. Server side cursors can also be enabled for all SELECT statements unconditionally by passing server_side_cursors=True to create_engine().

了解MySQL服务器游标可以看我的另一篇文章：MySQL服务器端游标详解．

Live and learn!

数据库数据迁移

pandas.read_sql的误导性参数chunksize