flask上传文件中文名称的处理

现象如下:

1
2
3
4
5
6
7
8
9
from werkzeug.utils import secure_filename
print(secure_filename("一级标题.png"))

print(secure_filename("kakakkak.png"))
# kakakkak.png # 纯英文的不忽视
print(secure_filename("1232233.png"))
# 1232233.png
print(secure_filename("迟迟不吃饭1232233.png"))
# 1232233.png # 看下只去除了中文字符

查看源码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import os, sys, re
text_type = str
PY2 = sys.version_info[0] == 2
WIN = sys.platform.startswith('win')
_format_re = re.compile(r'$(?:(%s)|{(%s)})' % (('[a-zA-Z_][a-zA-Z0-9_]*',) * 2))
_entity_re = re.compile(r'&([^;]+);')
_filename_ascii_strip_re = re.compile(r'[^A-Za-z0-9_.-]')
_windows_device_files = ('CON', 'AUX', 'COM1', 'COM2', 'COM3', 'COM4', 'LPT1',
'LPT2', 'LPT3', 'PRN', 'NUL')


def (filename):
r"""Pass it a filename and it will return a secure version of it. This
filename can then safely be stored on a regular file system and passed
to :func:`os.path.join`. The filename returned is an ASCII only string
for maximum portability.

On windows systems the function also makes sure that the file is not
named after one of the special device files.

>>> secure_filename("My cool movie.mov")
'My_cool_movie.mov'
>>> secure_filename("../../../etc/passwd")
'etc_passwd'
>>> secure_filename(u'i contain cool xfcmlxe4uts.txt')
'i_contain_cool_umlauts.txt'

The function might return an empty filename. It's your responsibility
to ensure that the filename is unique and that you generate random
filename if the function returned an empty one.

.. versionadded:: 0.5

:param filename: the filename to secure
"""
if isinstance(filename, text_type):
from unicodedata import normalize
# Marek Čech、Beniardá怎样变成相对应的ascii码呢, 调用下面这句
filename = normalize('NFKD', filename).encode('ascii', 'ignore')
if not PY2:
filename = filename.decode('ascii')
for sep in os.path.sep, os.path.altsep:
if sep:
filename = filename.replace(sep, ' ')
filename = str(_filename_ascii_strip_re.sub('', '_'.join(
filename.split()))).strip('._')

# on nt a couple of special files are present in each folder. We
# have to ensure that the target file is not such a filename. In
# this case we prepend an underline
if os.name == 'nt' and filename and
filename.split('.')[0].upper() in _windows_device_files:
filename = '_' + filename

return filename
'''
也就是说,, 上面这个方法只支持`a-zA-Z0-9.`这几种字符, 只支持ascii字符
'''

问题出在这一句filename = normalize('NFKD', filename).encode('ascii', 'ignore')encode('ascii', 'ignore')上,

我们知道, python中对字符串的encode方法对于错误的处理有几种方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
class str(object):
def encode(self, encoding='utf-8', errors='strict'): # real signature unknown; restored from __doc__
"""
S.encode(encoding='utf-8', errors='strict') -> bytes

Encode S using the codec registered for encoding. Default encoding
is 'utf-8'. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that can handle UnicodeEncodeErrors.
"""
return b""

具体效果如下:

1
2
3
4
5
6
7
8
"迟迟不吃饭1232233.png".encode('ascii', 'strict')
# Traceback (most recent call last):
# File "<input>", line 1, in <module>
# UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)
"迟迟不吃饭1232233.png".encode('ascii', 'replace')
# b'?????1232233.png'
"迟迟不吃饭1232233.png".encode('ascii', 'xmlcharrefreplace')
# b'&#36831;&#36831;&#19981;&#21507;&#39277;1232233.png'

解决方案

上传一张图片的时候, 例如一个名字是/haha/pp/ll.jpg, 那么是查找11.jpg还是/haha/pp目录下的11.jpg

1
2
3
4
5
6
7
8
9
import re

def validateTitle(title):
rstr = r"[/\:*?"'<>|]" # '/ : * ? " < > |'
new_title = re.sub(rstr, "_", title) # 替换为下划线
return new_title

print(validateTitle("""./ddd/sss\sd'sd\\迟迟不吃饭"1232233.png"""))
# ._ddd_sss_sd_sd__迟迟不吃饭_1232233.png

对于一些业务中会有很多图片需要保留原名称但是不同人上传的名称可能一样的, 可以保存为下面的结构体:

1
2
3
4
5
struct Img {
name string // 展示给用户的名称
secName string // 实际存储的名称
path string // 存储的路径
}

其他

normalize

Return the normal form 'form' for the Unicode string unistr.

Valid values for form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.

方法的四个参数