문자열에서 모든 utf-8이 아닌 기호 삭제

programing

문자열에서 모든 utf-8이 아닌 기호 삭제

iphone6s 2023. 5. 27. 09:56

문자열에서 모든 utf-8이 아닌 기호 삭제

저는 많은 양의 파일과 파서를 가지고 있습니다.제가 해야 할 일은 모든 utf-8 기호를 제거하고 데이터를 mongodb에 넣는 것입니다.현재 저는 이런 코드를 가지고 있습니다.

with open(fname, "r") as fp:
    for line in fp:
        line = line.strip()
        line = line.decode('utf-8', 'ignore')
        line = line.encode('utf-8', 'ignore')

왠지 아직도 오류가 납니다.

bson.errors.InvalidStringData: strings in documents must be valid UTF-8: 
1/b62010montecassianomcir\xe2\x86\x90ta0\xe2\x86\x90008923304320733/290066010401040101506055soccorin

이해가 잘 안 돼요.간단한 방법이 있습니까?

UPD: Python과 Mongo가 Utf-8 유효 문자열의 정의에 동의하지 않는 것 같습니다.

마지막 두 줄 대신 코드 줄 아래에서 시도하십시오.도움이 되길 바랍니다.

line=line.decode('utf-8','ignore').encode("utf-8")

이 스레드의 주석에서 언급한 바와 같이 python 3의 경우 다음을 수행할 수 있습니다.

line = bytes(line, 'utf-8').decode('utf-8', 'ignore')

'무시' 매개 변수는 문자를 디코딩할 수 없는 경우 오류가 발생하는 것을 방지합니다.

줄이 이미 바이트 개체인 경우(예:b'my string') 그러면 당신은 단지 그것을 디코딩하기만 하면 됩니다.decode('utf-8', 'ignore').

outf-8 문자 처리 예제

import string

test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"

print ''.join(x for x in test if x in string.printable)

with open(fname, "r") as fp:
for line in fp:
    line = line.strip()
    line = line.decode('cp1252').encode('utf-8')

언급URL : https://stackoverflow.com/questions/26541968/delete-every-non-utf-8-symbols-from-string

'programing' 카테고리의 다른 글

ng: angular-cli를 사용하여 새 프로젝트를 만드는 동안 명령을 찾을 수 없습니다. (0)	2023.06.01
레일 3 데이터 유형? (0)	2023.06.01
사용자 지정 컨테이너 환경 변수와 함께 Azure Webapp 배포 (0)	2023.05.27
전자 메일 주소가 시스템에 유효한지 확인합니다.넷.메일.메일주소 (0)	2023.05.27
python datetime.datetime을 일련 번호를 Excel로 변환하는 방법 (0)	2023.05.27

현재글문자열에서 모든 utf-8이 아닌 기호 삭제

각종 프로그래밍 정보를 다루는 블로그입니다.

angularJS, Git, spring-boot, TypeScript, PowerShell, reactjs, Oracle, jQuery, ASP.NET, ajax, Wordpress, MySQL, Python, sql-server, WPF, mariadb, JSON, C, Excel, MongoDB,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

iphone6s

문자열에서 모든 utf-8이 아닌 기호 삭제

문자열에서 모든 utf-8이 아닌 기호 삭제

'programing' 카테고리의 다른 글

'programing'의 다른글

티스토리툴바

문자열에서 모든 utf-8이 아닌 기호 삭제

문자열에서 모든 utf-8이 아닌 기호 삭제

'programing' 카테고리의 다른 글

'programing'의 다른글

관련글

티스토리툴바