Wednesday, January 23, 2008

Minor Python Oddity: Rename and Copy in Different Modules?

As part of my learn-as-I-go strategy to picking up new languages, I tried a Python script for a task I've been frequently doing.

I'd written a short Python script last week that did something more aggressive and was pleased with how compact it was (once I rewrote my loops with list comprehensions), as well as snappy fast.

This script was dirt simple, it needs to ensure a a file is copied/moved to two other locations (the original file should. Normally I would have used a Bourne shell script:

src="/path/foo"
dest1="/path2"
dest2="/path3"
cp $src $dest1
mv $src $dest2
Maybe a slight improvement to check if it's needed:
src="/path/foo"
dest1="/path2"
dest2="/path3"
if [ -r $src ] ; then
cp $src $dest1
mv $src $dest2
fi
Anyway, so the result looks pretty similar:
import sys
import os
import shutil

src="/path/foo"
dest1="/path2"
dest2="/path3"

if os.access(src, os.R_OK):
shutil.copy(src, dest1)
os.renames(src, dest2)
It seemed odd that copy and move are in different modules: the os module has rename, but in order to get copy I need the shutil module. I was annoyed because it took me a while to locate the copy documentation. Maybe I'm just rationalizing my annoyance, but aren't these modules are little overzealously partitioned? And isn't shutil just a really silly name?

What is the logic to this that puts file stat, list directory, change directory, make directory, remove directory, file name, create symlink, and just about every other straightforward directory/file operation in one module, but copying in another?

Ah well, I recognize this as just a bump in learning a different library system. I wonder if the reason I'm annoyed is from my ambiguity intolerance: even though the difference is minor practically (I can easily learn to import shutil) I don't understand the reason behind the structure, and that bothers me. Ambiguity tolerance is an interesting phenomenon and I was taught in (human) linguistics courses that a lack of tolerance for ambiguity can be a hindrance to language acquisition. I seriously consider the same may hold true for adapting new programming languages and tools: aside from other considerations, if people can't understand and navigate the structure and logic behind your system, and they have a measure of ambiguity intolerance, they may quickly brush it off. Just a thought.

5 comments:

Anonymous said...

I think it comes down to levels of native support by the O/S.

Moving files, listing, stat, etc. are all very "lightweight" operations (at least on Unix – I can't tell you about elsewhere) provided natively by the O/S, and the python os module is just providing a "thin wrapper" around those services – just single system calls, essentially. Moving/renaming is lightweight because it just involves altering directories, not files, where directories can be thought of, pythonically, as dictionaries mapping file names to inode numbers. Unless moving across filesystem boundaries (see below) you don't actually touch the file's data. Under Unix, moving a 10GB file can be exactly as quick as moving a 1KB file. :-)

Copying, on the other hand, is more work and ordinarily implemented in a library above the level of the O/S: you ask the O/S to open the source file for reading and return a handle to it, then to create a new empty file for writing and return a handle on it, then you read the bytes from the source and write them to the destination until you're finished, then you close both files. and give you a handle on it. That's a whole lot more system calls, and is likely to take a lot more time and have a lot more failure modes. It's decidely not a thin wrapper around an O/S system call (at least, on Unix).

This is, anyway, why I've always thought shutil was separated out in that way.

(Of course, moving isn't lightweight across filesystem boundaries because you really do have to move the data, ie copy it to the new place and delete it from the old - hence the caveat in the python docs that "the operation may fail on some Unix flavors if src and dst are on different filesystems".)

JFKBits said...

gimboland: Perfect, this makes sense. Thanks for your thoughtful and complete reply! When I saw the mknod and similar low-level functions (thin OS call wrappers, as you point out) it did move me in that direction, but I never noticed they all seemed to be of that nature, the thin OS call.

I still think shutil ("shuttle"? "shootle"? "shh--util"?) is a silly name, but it at least makes sense.

Anonymous said...

More a question than a comment: I'm writing a script to back things up into a tarball of about 26mb and copy it to a flash drive. If I use shutil.copy it takes like 54 seconds; if I use os.system to call cp it takes like 8 seconds. (I'm doing this on a Mac; if you use Finder to copy that tarball it takes like 8 seconds.) If I use os.system to call mv, it also takes about 54 seconds.

helium said...

To me shutil means Sh(ell) utils, because it offers the missing commands you have in most shells. But I'm not sure.

ChadN said...

To Anonymous above who commented on the difference in file copy times using shutil.copy() vs. system copy command: shutil.copyfile() is implemented portably (and trivially) in python, using a read/write loop on open files. The system copy can be optimized quite a bit more (including not having to copy file data from system to user space, and back).

If you want more speed, but less portability, you can use system level copies (such as os.system("cp blah blah2") on Unix, or win32file.CopyFile() on windows).