Python practical tricks
Processing Files
Copy files
In Python, you can copy the files using
shutil
moduleos
modulesubprocess
module
import os
import shutil
import subprocess
Copying files using shutil
module
shutil.copyfile
signature
shutil.copyfile(src_file, dest_file, *, follow_symlinks=True)
# example
shutil.copyfile('source.txt', 'destination.txt')
shutil.copy
signature
shutil.copy(src_file, dest_file, *, follow_symlinks=True)
# example
shutil.copy('source.txt', 'destination.txt')
shutil.copy2
signature
shutil.copy2(src_file, dest_file, *, follow_symlinks=True)
# example
shutil.copy2('source.txt', 'destination.txt')
shutil.copyfileobj
signature
shutil.copyfileobj(src_file_object, dest_file_object[, length])
# example
file_src = 'source.txt'
f_src = open(file_src, 'rb')
file_dest = 'destination.txt'
f_dest = open(file_dest, 'wb')
shutil.copyfileobj(f_src, f_dest)
Clarification on shutil
module
Function |
Copies metadata |
Copies permissions |
Uses file object |
Destination may be directory |
---|---|---|---|---|
shutil.copy | No | Yes | No | Yes |
shutil.copyfile | No | No | No | No |
shutil.copy2 | Yes | Yes | No | Yes |
shutil.copyfileobj | No | No | Yes | No |
- 9 Note that even the
shutil.copy2()
function cannot copy all file metadata. – wovano Mar 11, 2022 at 10:34
Copying files using os
module
os.popen
signature
os.popen(cmd[, mode[, bufsize]])
# example
# In Unix/Linux
os.popen('cp source.txt destination.txt')
# In Windows
os.popen('copy source.txt destination.txt')
os.system
signature
os.system(command)
# In Linux/Unix
os.system('cp source.txt destination.txt')
# In Windows
os.system('copy source.txt destination.txt')
Copying files using subprocess
module
subprocess.call
signature
subprocess.call(args, *, stdin=None, stdout=None, stderr=None, shell=False)
# example (WARNING: setting `shell=True` might be a security-risk)
# In Linux/Unix
status = subprocess.call('cp source.txt destination.txt', shell=True)
# In Windows
status = subprocess.call('copy source.txt destination.txt', shell=True)
subprocess.check_output
signature
subprocess.check_output(args, *, stdin=None, stderr=None, shell=False, universal_newlines=False)
# example (WARNING: setting `shell=True` might be a security-risk)
# In Linux/Unix
status = subprocess.check_output('cp source.txt destination.txt', shell=True)
# In Windows
status = subprocess.check_output('copy source.txt destination.txt', shell=True)
JSON
Load json
file
import json
with open('site_occ.json') as file:
parsed_json = json.load(file)
print(parsed_json)
List, tuple and dict
Remove all occurences of an element from a list
Remove all occurrences of a value from a list?
Functional approach:
Python 3.x
>>> x = [1,2,3,2,2,2,3,4]
>>> list(filter((2).__ne__, x))
[1, 3, 3, 4]
Sort list of lists
Source stackoverflow.com
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]
itertools
often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!
Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it’s so much easier that it offers good returns on efforts. But sometimes (essentially for “tragically crucial bottlenecks” in deep inner loops of code that’s pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one’s apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.
Careful measurements of “point” performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit
helps here. However, it’s easier to use it at a shell prompt. For example, here’s a short module to showcase the general approach for this problem, save it as nodup.py
:
import itertools
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
def doset(k, map=map, list=list, set=set, tuple=tuple):
return map(list, set(map(tuple, k)))
def dosort(k, sorted=sorted, xrange=xrange, len=len):
ks = sorted(k)
return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]
def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
ks = sorted(k)
return [i for i, _ in itertools.groupby(ks)]
def donewk(k):
newk = []
for i in k:
if i not in newk:
newk.append(i)
return newk
# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
savek = list(k)
for f in doset, dosort, dogroupby, donewk:
resk = f(k)
assert k == savek
print '%10s %s' % (f.__name__, sorted(resk))
Note the sanity check (performed when you just do python nodup.py
) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.
Now we can run checks on the tiny example list:
$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop
confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:
$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop
the quadratic approach isn’t bad, but the sort and groupby ones are better. Etc, etc.
If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it’s worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).
It’s also well worth considering keeping a different representation for k
– why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program’s performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.
Sort list of dicts
Sort by values
Source: stackoverflow.com
The sorted()
function takes a key=
parameter
newlist = sorted(list_to_be_sorted, key=lambda d: d['name'])
Alternatively, you can use operator.itemgetter
instead of defining the function yourself
from operator import itemgetter
newlist = sorted(list_to_be_sorted, key=itemgetter('name'))
For completeness, add reverse=True
to sort in descending order
newlist = sorted(list_to_be_sorted, key=itemgetter('name'), reverse=True)
Sort by key names
Source: stackoverflow.com
Just use sorted
using a list like [key1 in dict, key2 in dict, ...]
as the key to sort by. Remember to reverse the result, since True
(i.e. key is in dict) is sorted after False
.
>>> dicts = [{1:2, 3:4}, {3:4}, {5:6, 7:8}]
>>> keys = [5, 3, 1]
>>> sorted(dicts, key=lambda d: [k in d for k in keys], reverse=True)
[{5: 6, 7: 8}, {1: 2, 3: 4}, {3: 4}]
This is using all the keys to break ties, i.e. in above example, there are two dicts that have the key 3
, but one also has the key 1
, so this one is sorted second.
Remove a key from a dict
Source: stackoverflow.com
To delete a key regardless of whether it is in the dictionary, use the two-argument form of dict.pop()
:
my_dict.pop('key', None)
This will return my_dict[key]
if key
exists in the dictionary, and None
otherwise. If the second parameter is not specified (i.e. my_dict.pop('key')
) and key
does not exist, a KeyError
is raised.
To delete a key that is guaranteed to exist, you can also use
del my_dict['key']
This will raise a KeyError
if the key is not in the dictionary.
Variables
Create dynamic variable names
Using for
loop
The creation of a dynamic variable name in Python can be achieved with the help of iteration.
Along with the for
loop, the globals()
function will also be used in this method.
The globals()
method in Python provides the output as a dictionary of the current global symbol table.
The following code uses the for
loop and the globals()
method to create a dynamic variable name in Python.
for n in range(0, 7):
globals()['strg%s' % n] = 'Hello'
# strg0 = 'Hello', strg1 = 'Hello' ... strg6 = 'Hello'
for x in range(0, 7):
globals()[f"variable1{x}"] = f"Hello the variable number {x}!"
print(variable5)
Output:
Hello from variable number 5!
Using a dictionary
A dictionary is one of the four built-in data-types provided by Python along with tuple, list, and set. It is used to store data in the form of key: value pairs. A dictionary is both ordered (in Python 3.7 and above) and mutable. It is written with curly brackets {}
. In addition to this, dictionaries cannot have any duplicates.
A dictionary has both a key and value, so it is easy to create a dynamic variable name using dictionaries.
The following code uses a dictionary to create a dynamic variable name in python.
var = "a"
val = 4
dict1 = {var: val}
print(dict1["a"])
Although the creation of a dynamic variable name is possible in Python, It is needless and unnecessary as data in Python is created dynamically. Python references the objects in the code. If the reference of the object exists, then the object itself exists.
Creating a variable in this way is not recommended.