[찍먹분투기] AI Hub 일상생활 및 구어체 한-영 번역 말뭉치 데이터 AI 모델 분석

유레카! 이게 왠 떡!

한영, 영한 모델 만들기 위해 데이터를 구하려다 AI 허브에서 데이터 뿐 아니라 실제 학습된 활용 모델까지 제공한다는 것을 알게 됨.

https://www.aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=71265

AI-Hub

샘플 데이터 ? ※샘플데이터는 데이터의 이해를 돕기 위해 별도로 가공하여 제공하는 정보로써 원본 데이터와 차이가 있을 수 있으며, 데이터에 따라서 민감한 정보는 일부 마스킹(*) 처리가 되

www.aihub.or.kr

아예 학습된 인공지능 모델까지 제공하네?

도커 이미지를 통으로 올려 놨네.. 뭐 어차피 내부에 ckpt 파일이 있겠지.

일단 매뉴얼대로 함 가보자.

도커 이미지를 링크에서 통으로 다운 받는다. 대략 8기가 조금 넘는다.

그리고 docker load를 통해 tar 파일을 통으로 올린다. 참 단순 무식하다고 볼 수도 있긴 하지만 또 나름 편한 구석도 있다.

억… 하드 용량 부족… 도커 이미지 올리는 중간에 뻗었다. 좀비 데이터 어떻게 정리하나..

자동 정리되는 것 같지는 않은데..

흠...

도커 이미지 로드

다른 넉넉한 장비로 옮겨 다시 로드.

잘 된것 같다.

whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ sudo docker load -i ai-lang-docker.tar 
b14cb48b3aeb: Loading layer [==================================================>]  119.3MB/119.3MB
a1215953fc64: Loading layer [==================================================>]  17.18MB/17.18MB
b20560b6a21c: Loading layer [==================================================>]  17.87MB/17.87MB
330a6fb3364f: Loading layer [==================================================>]    150MB/150MB
126712f9d0fb: Loading layer [==================================================>]  520.9MB/520.9MB
9e0330b2c436: Loading layer [==================================================>]  18.51MB/18.51MB
133fd28b2544: Loading layer [==================================================>]   54.3MB/54.3MB
2e4f5df649f5: Loading layer [==================================================>]  4.608kB/4.608kB
7d0710ce7529: Loading layer [==================================================>]  8.877MB/8.877MB
afb49366d71a: Loading layer [==================================================>]   2.39GB/2.39GB
a4035109de11: Loading layer [==================================================>]  5.524GB/5.524GB
Loaded image: letr:0.2.0

이미지 목록을 찍어 보니 잘 등록되었다.

whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ sudo docker images
REPOSITORY                         TAG           IMAGE ID       CREATED         SIZE
**letr                               0.2.0         96cc7eab4a05   7 months ago**    8.78GB
quay.io/coreos/flannel             v0.14.0-rc1   0a1a2818ce59   18 months ago   67.9MB
k8s.gcr.io/kube-proxy              v1.21.0       38ddd85fe90e   18 months ago   122MB
nvcr.io/nvidia/k8s-device-plugin   v0.9.0        37b8c3899b15   19 months ago   191MB
k8s.gcr.io/pause                   3.4.1         0f8457a4c2ec   21 months ago   683kB

흠.. container도 잘 뜨는 것 같고..

whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ sudo docker run --name letr -d -p 5000:5000 letr:0.2.0
8ce1439ad449f2a37f5576ea76c4d299cf6b4cd9c91658cc8e9ae6098a8d72df
whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ sudo docker ps
CONTAINER ID   IMAGE        COMMAND           CREATED         STATUS         PORTS                                       NAMES
8ce1439ad449   letr:0.2.0   "python app.py"   6 seconds ago   Up 6 seconds   0.0.0.0:5000->5000/tcp, :::5000->5000/tcp   letr

예제코드를 실행해 보자.

import requests
import json
import time
import argparse
from pprint import pprint

URL_LETR = '<http://0.0.0.0:5000>'
URL_LETR_TRANSLATE = URL_LETR + '/translate'

def translate(sentences, source_language_code, target_language_code):
    print('Start LETR Translate ...')

    values = {'sentences': sentences, 'source_language_code': source_language_code,
              'target_language_code': target_language_code}
    result = requests.post(URL_LETR_TRANSLATE, data=json.loads(json.dumps(values)))
    output = json.loads(result.text)
    return output

if __name__ == '__main__':

    en_samples = [
        "Hi, Nice to meet you.",
	"A look into Coca Cola’s first marketing attempt on TikTok",
        "How long does a vacation have to last before it’s just…moving?",
        "Attached is the document requested for the work.",
        "Please let me know if you have questions.",
        "Sorry for the delay in getting back to you on this.",
    ]
    output = translate(en_samples, 'en', 'ko')
    pprint(output)

잘 돌아 가긴하는데, 조금 느리네? 10초 이상 걸린것으로 봐서는 CPU를 쓴 것 같다.

3090 짱짱한 호스트 머신임에도…

일단 컨테이너에 들어가 보자.

(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ sudo docker exec -it 8ce1439ad449 /bin/bash
root@8ce1439ad449:/letr# 
root@8ce1439ad449:/letr# ls -la
total 80
drwxr-xr-x 1 root root  4096 Feb 24  2022 .
drwxr-xr-x 1 root root  4096 Oct 13 12:22 ..
-rw-rw-r-- 1 root root  6148 Feb 21  2022 .DS_Store
-rw-rw-r-- 1 root root     0 Feb 21  2022 .gitignore
-rw-rw-r-- 1 root root   386 Feb 23  2022 Dockerfile
-rw-rw-r-- 1 root root   924 Feb 16  2021 README.md
-rw-rw-r-- 1 root root 13237 Feb 16  2021 __init__.py
-rw-rw-r-- 1 root root  6456 Feb 22  2022 app.py
drwxrwxr-x 4 root root  4096 Feb 22  2022 gcon_translator
-rw-rw-r-- 1 root root  1427 Feb 23  2022 requirements.txt
drwxrwxr-x 3 root root  4096 Feb 21  2022 result
-rw-rw-r-- 1 root root    27 Feb 16  2021 run.sh
-rw-rw-r-- 1 root root  4094 Feb 22  2022 test.py
drwxrwxr-x 1 root root  4096 Feb 22  2022 transformer_translator

흠.. 여러가지 필요해 보이는 것들이 있군요.

호스트 장비로 다 꺼내 봅시다.

(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ sudo docker cp 8ce1439ad449:/letr .
(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ ls -la
total 8614580
drwxrwxr-x 3 whyun whyun       4096 Oct 13 22:37 .
drwxrwxr-x 3 whyun whyun       4096 Oct 13 22:19 ..
-rw-rw-r-- 1 whyun whyun 8821302784 Oct 13 21:17 ai-lang-docker.tar
drwxr-xr-x 5 root  root        4096 Feb 24  2022 letr
-rw-rw-r-- 1 whyun whyun       1027 Oct 13 21:23 test.py
(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate$ cd letr/
(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate/letr$ ls -al
total 72
drwxr-xr-x 5 root  root   4096 Feb 24  2022 .
drwxrwxr-x 3 whyun whyun  4096 Oct 13 22:37 ..
-rw-rw-r-- 1 root  root   6148 Feb 21  2022 .DS_Store
-rw-rw-r-- 1 root  root      0 Feb 21  2022 .gitignore
-rw-rw-r-- 1 root  root    386 Feb 23  2022 Dockerfile
-rw-rw-r-- 1 root  root    924 Feb 16  2021 README.md
-rw-rw-r-- 1 root  root  13237 Feb 16  2021 __init__.py
-rw-rw-r-- 1 root  root   6456 Feb 22  2022 app.py
drwxrwxr-x 4 root  root   4096 Feb 22  2022 gcon_translator
-rw-rw-r-- 1 root  root   1427 Feb 23  2022 requirements.txt
drwxrwxr-x 3 root  root   4096 Feb 21  2022 result
-rw-rw-r-- 1 root  root     27 Feb 16  2021 run.sh
-rw-rw-r-- 1 root  root   4094 Feb 22  2022 test.py
drwxrwxr-x 8 root  root   4096 Feb 22  2022 transformer_translator
(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate/letr$ du -sk .

모양새를 봐야지유?

├── Dockerfile
├── README.md
├── __init__.py
├── app.py
├── gcon_translator
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-37.pyc
│   │   ├── __init__.cpython-38.pyc
│   │   ├── constant.cpython-38.pyc
│   │   ├── elastic_search.cpython-37.pyc
│   │   ├── elastic_search.cpython-38.pyc
│   │   ├── google_api.cpython-38.pyc
│   │   ├── ngram.cpython-38.pyc
│   │   ├── td_searcher.cpython-37.pyc
│   │   └── td_searcher.cpython-38.pyc
│   ├── constant.py
│   ├── elastic_search.py
│   ├── ngram.py
│   ├── td_searcher.py
│   └── utils
│       ├── __init__.py
│       ├── __pycache__
│       │   ├── __init__.cpython-38.pyc
│       │   ├── parser.cpython-38.pyc
│       │   └── sentence_tokenizer.cpython-38.pyc
│       ├── parser.py
│       └── sentence_tokenizer.py
├── requirements.txt
├── result
│   └── en-ko_submit_0211
│       ├── info_log.txt
│       ├── model_deserialized.pt
│       └── params.json
├── run.sh
├── test.py
└── transformer_translator
    ├── Dataloader.py
    ├── __init__.py
    ├── __pycache__
    │   ├── Dataloader.cpython-38.pyc
    │   ├── __init__.cpython-37.pyc
    │   ├── __init__.cpython-38.pyc
    │   ├── predict.cpython-38.pyc
    │   ├── trainer.cpython-37.pyc
    │   └── trainer.cpython-38.pyc
    ├── data
    │   └── predict_text.txt
    ├── dataset
    │   └── ai_hub_total_nia_2020_total_nia_2021_food_total_0209_nia_2021_gu_0209_nia_2021_tech_0209_mecab_ko_moses
    │       ├── test.pickle
    │       ├── vocab_en.json
    │       └── vocab_ko.json
    ├── metric
    │   ├── __init__.py
    │   └── get_score.py
    ├── model
    │   ├── __pycache__
    │   │   ├── attention.cpython-37.pyc
    │   │   ├── attention.cpython-38.pyc
    │   │   ├── decoder.cpython-37.pyc
    │   │   ├── decoder.cpython-38.pyc
    │   │   ├── encoder.cpython-37.pyc
    │   │   ├── encoder.cpython-38.pyc
    │   │   ├── feedforward.cpython-37.pyc
    │   │   ├── feedforward.cpython-38.pyc
    │   │   ├── serialized_layers.cpython-37.pyc
    │   │   ├── serialized_layers.cpython-38.pyc
    │   │   ├── transformer.cpython-37.pyc
    │   │   └── transformer.cpython-38.pyc
    │   ├── attention.py
    │   ├── decoder.py
    │   ├── encoder.py
    │   ├── feedforward.py
    │   ├── optimizer.py
    │   └── transformer.py
    ├── predict.py
    ├── trainer.py
    └── utils
        ├── __init__.py
        ├── __pycache__
        │   ├── __init__.cpython-38.pyc
        │   └── utils.cpython-38.pyc
        ├── cache.py
        └── utils.py

16 directories, 69 files

호스트에서 직접 돌려보기 위해 필요한 패키지 설치(requirements.txt)함.

아마 호스트 도커 설치할 때 cuda enable 하지 않아서 cpu 로 동작한 것 같음.

그리고 이 명령으로 뭔가를 설치해야 함

$ python -m spacy download en_core_web_sm

다시 실행시키니..

(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate/letr$ python app.py 
Traceback (most recent call last):
  File "/home/whyun/miniconda3/envs/pt/lib/python3.8/site-packages/MeCab/__init__.py", line 133, in __init__
    super(Tagger, self).__init__(args)
RuntimeError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "app.py", line 4, in 
    from gcon_translator import Translator
  File "/home/whyun/workspace/aihub-transformer-translate/letr/gcon_translator/__init__.py", line 5, in 
    from .td_searcher import TDSearcher
  File "/home/whyun/workspace/aihub-transformer-translate/letr/gcon_translator/td_searcher.py", line 1, in 
    from .elastic_search import ElasticSearchManger, ELASTIC_CANDIDATE_MAX_SIZE
  File "/home/whyun/workspace/aihub-transformer-translate/letr/gcon_translator/elastic_search.py", line 18, in 
    from .ngram import split_into_ngram
  File "/home/whyun/workspace/aihub-transformer-translate/letr/gcon_translator/ngram.py", line 8, in 
    ja_mecab = MeCab.Tagger()  # load dictionary
  File "/home/whyun/miniconda3/envs/pt/lib/python3.8/site-packages/MeCab/__init__.py", line 135, in __init__
    raise RuntimeError(error_info(rawargs)) from ee
RuntimeError: 
----------------------------------------------------------

Failed initializing MeCab. Please see the README for possible solutions:

    <https://github.com/SamuraiT/mecab-python3#common-issues>

If you are still having trouble, please file an issue here, and include the
ERROR DETAILS below:

    <https://github.com/SamuraiT/mecab-python3/issues>

issueを英語で書く必要はありません。

------------------- ERROR DETAILS ------------------------
arguments: 
[ifs] no such file or directory: /usr/local/etc/mecabrc
----------------------------------------------------------

뭔가.. 일본어에 관계된 것 같은게 남아 있군.. 난 일본어는 필요 없는데..?

아무튼.. 뭔가 이유가 있으니 썼겠지.

In order to use MeCab you'll need to install a dictionary. unidic-lite is a good one to start with:

pip install unidic-lite

그래도 에러 뜸.

Traceback (most recent call last):
  File "app.py", line 4, in <module>
    from gcon_translator import Translator
  File "/home/whyun/workspace/aihub-transformer-translate/letr/gcon_translator/__init__.py", line 5, in <module>
    from .td_searcher import TDSearcher
  File "/home/whyun/workspace/aihub-transformer-translate/letr/gcon_translator/td_searcher.py", line 1, in <module>
    from .elastic_search import ElasticSearchManger, ELASTIC_CANDIDATE_MAX_SIZE
  File "/home/whyun/workspace/aihub-transformer-translate/letr/gcon_translator/elastic_search.py", line 41, in <module>
    mecab = Mecab()
  File "/home/whyun/miniconda3/envs/pt/lib/python3.8/site-packages/konlpy/tag/_mecab.py", line 80, in __init__
    raise Exception('The MeCab dictionary does not exist at "%s". Is the dictionary correctly installed?\\nYou can also try entering the dictionary path when initializing the Mecab class: "Mecab(\\'/some/dic/path\\')"' % dicpath)
Exception: The MeCab dictionary does not exist at "/usr/local/lib/mecab/dic/mecab-ko-dic". Is the dictionary correctly installed?
You can also try entering the dictionary path when initializing the Mecab class: "Mecab('/some/dic/path')"

mecab-ko-dic이 설치가 안 되어 있어서 그런 것이다.

도커 컨테이너에 들어가 보면, 아래와 같은 파일들이 있다.

root@ff69af741762:/letr# ls -l /usr/local/lib/mecab/dic/mecab-ko-dic/
total 109580
-rw-r--r-- 1 root root   262560 Feb 24  2022 char.bin
-rw-r--r-- 1 root root     1419 Feb 24  2022 dicrc
-rw-r--r-- 1 root root    76393 Feb 24  2022 left-id.def
-rw-r--r-- 1 root root 20585296 Feb 24  2022 matrix.bin
-rw-r--r-- 1 root root 10583428 Feb 24  2022 model.bin
-rw-r--r-- 1 root root     1550 Feb 24  2022 pos-id.def
-rw-r--r-- 1 root root     2479 Feb 24  2022 rewrite.def
-rw-r--r-- 1 root root   114511 Feb 24  2022 right-id.def
-rw-r--r-- 1 root root 80558854 Feb 24  2022 sys.dic
-rw-r--r-- 1 root root     4170 Feb 24  2022 unk.dic

호스트에도 설치해 보자.

먼저 MeCab 한글 형태소 분석기를 설치해야 한다.

https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/

Bitbucket

bitbucket.org

자세한 설명은 아래 링크에서 참고.

[Python] Mecab 한글 형태소 분석기 바인딩

Mecab 한글 형태소 분석기 은전한닢 오픈소스 프로젝트: http://eunjeon.blogspot.kr/2013/02/blog-post.html Mecab-ko-dic Bitbucket: https://bitbucket.org/eunjeon/mecab-ko-dic Ubuntu에서 Mecab 설치 Mac: brew install mecab Linux(Ubuntu): s

sens.tistory.com

일단,

다운로드 하고.. 컴파일.. 설치

./configure;make;make install

내 시스템의 경우 /home/whyun/miniconda3/envs/pt/lib/mecab/dic/mecab-ko-dic 에 설치되었다.

그래서 gcon_translator/init.py의 파일을 열어 아래 처럼 바꿔줌

mecab = Mecab('/home/whyun/miniconda3/envs/pt/lib/mecab/dic/mecab-ko-dic')

그리고, elastic_search.py 에서도 동일하게 위치를 바꿔 주었음.

이번엔 pororo 패키지가 없다고 나옴. pip로 설치함.

https://pypi.org/project/pororo/

pororo

Pororo: A Deep Learning based Multilingual Natural Language Processing Library

pypi.org

PyTorch 1.6.0을 다시 다운 받네? requirements.txt로 설치할 때도 했는데 또 하네..

음… 이래서 그런가봐..

카카오브레인에서 공개하긴 했는데… 버전 업데이트가 안 되는 것을 보니.. 더 이상 지원하는 것 같진 않음.

어쨌든, 일단 실행해 보니 웹 서버까지는 잘 구동 됨. 근데 문제가 3090은 PyTorch 버전 높은거 써야 함.

예제를 돌려보니..

{'data': None, 'error': "translate error: \\n*
   *********************************************************************\\n  
   Resource \\x1b[93mpunkt\\x1b[0m not found.\\n  
   Please use the NLTK Downloader to obtain the resource:\\n\\n  
   \\x1b[31m>>> import nltk\\n  >>> nltk.download('punkt')\\n  
   \\x1b[0m\\n  For more information see: <https://www.nltk.org/data.html\\n\\n>  
   Attempted to load \\x1b[93mtokenizers/punkt/PY3/english.pickle\\x1b[0m\\n\\n  
   Searched in:\\n    - '/home/whyun/nltk_data'\\n    - '/home/whyun/miniconda3/envs/pt/nltk_data'\\n    - '/home/whyun/miniconda3/envs/pt/share/nltk_data'\\n    - '/home/whyun/miniconda3/envs/pt/lib/nltk_data'\\n    - '/usr/share/nltk_data'\\n    - '/usr/local/share/nltk_data'\\n    - '/usr/lib/nltk_data'\\n    - '/usr/local/lib/nltk_data'\\n    - ''\\n**********************************************************************\\n", 'reason': 'Please check your parameters.'}
[Korean Sentence Splitter]: 127.0.0.1 - - [15/Oct/2022 02:06:10] "POST /translate HTTP/1.1" 200 -

NLTK 데이터가 없다는 얘기인데..

아래 처럼 해서 일단 해결

(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate/letr$ python -m nltk.downloader punkt
/home/whyun/miniconda3/envs/pt/lib/python3.8/runpy.py:127: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))
[nltk_data] Downloading package punkt to /home/whyun/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.

흠.. 이건 또 뭐야..

>>>>>>>>>>>>>>>>> config... 
:: Device / Hidden / Max_len: cuda:0, 1024, 128
:: ENC - 12, 8, 2048, 0.1
:: DEC - 12, 8, 2048, 0.1
:: SP_TOKENS - 81, VOCAB_SRC_SIZE - 40000, VOCAB_TRG_SIZE - 45000
/home/whyun/miniconda3/envs/pt/lib/python3.8/site-packages/torch/cuda/__init__.py:125: UserWarning: 
NVIDIA GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
If you want to use the NVIDIA GeForce RTX 3090 GPU with PyTorch, please check the instructions at <https://pytorch.org/get-started/locally/>

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
translate() sentences # =  6
before: Hi, Nice to meet you.
before: A look into Coca Cola’s first marketing attempt on TikTok
before: How long does a vacation have to last before it’s just…moving?
before: Attached is the document requested for the work.
before: Please let me know if you have questions.
before: Sorry for the delay in getting back to you on this.
source_language_code: en
target_language_code: ko

user_id: USER_ID
selected_dicts: None
---
TD Searcher init
Start TD search ... 6
not str None

Non-matched after TDSearch: 
   index                                                src tgt matchType
0      0                              Hi, Nice to meet you.          none
1      1  A look into Coca Cola’s first marketing attemp...          none
2      2  How long does a vacation have to last before i...          none
3      3   Attached is the document requested for the work.          none
4      4          Please let me know if you have questions.          none
5      5  Sorry for the delay in getting back to you on ...          none

>>>>>>>>>>>>>>>> Translating.... from [en] to [ko]
{'data': None, 'error': 'translate error: The MeCab dictionary does not exist at "/usr/local/lib/mecab/dic/mecab-ko-dic". Is the dictionary correctly installed?\\nYou can also try entering the dictionary path when initializing the Mecab class: "Mecab(\\'/some/dic/path\\')"', 'reason': 'Please check your parameters.'}
[Korean Sentence Splitter]: 127.0.0.1 - - [15/Oct/2022 02:11:23] "POST /translate HTTP/1.1" 200 -

MeCab dictionary를 또 못 찾네.

이건 path를 고치는 것을 빼먹어서 그런 거..

predict 함수에서도 MeCab을 사용하는데 빼먹었음.

얘네도 다 바꿔 줌.

이후 실행하니…

>>>>>>>>>>>>>>>> Translating.... from [en] to [ko]
{'data': None, 'error': 'translate error: CUDA error: no kernel image is available for execution on the device', 'reason': 'Please check your parameters.'}
[Korean Sentence Splitter]: 127.0.0.1 - - [15/Oct/2022 02:15:09] "POST /translate HTTP/1.1" 200 -

흠… 비디오 카드와 PyTorch 버전, Pororo 버전이 안 맞는 것 때문에 생긴 문제 같은데..

파이토치를 최신버전으로 깔아 봄.

pip3 install torch torchvision torchaudio --extra-index-url <https://download.pytorch.org/whl/cu113>

어라? torch 1.6가 다시 설치 되네?

다시 아래처럼 해 봄.

conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 cudatoolkit=11.3 -c pytorch

음.. 다시 다운로드 하다보니 한참 걸리네..

어쨌든 최신에 가까운 버전으로 업데이트 해 주고.. 다시 실행해 봄.

(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate/letr$ python test.py 
Start LETR Translate ...
{'data': {'candidates': [],
          'charactersToCharge': 233,
          'source_language_code': 'en',
          'target_language_code': 'ko',
          'total': 6,
          'translated': [{'index': 0,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'Hi, Nice to meet you.',
                          'target_language_code': 'ko',
                          'tgt': '안녕하세요, 반갑습니다.'},
                         {'index': 1,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'A look into Coca Cola’s first marketing '
                                 'attempt on TikTok',
                          'target_language_code': 'ko',
                          'tgt': '코카콜라의 코카콜라 마케팅 시도를 먼저 살펴보세요.'},
                         {'index': 2,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'How long does a vacation have to last before '
                                 'it’s just…moving?',
                          'target_language_code': 'ko',
                          'tgt': '저번 휴가 때까지 얼마나 이동하면 되나요?'},
                         {'index': 3,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'Attached is the document requested for the '
                                 'work.',
                          'target_language_code': 'ko',
                          'tgt': '첨부된 서류는 업무에 대한 요청이 있습니다.'},
                         {'index': 4,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'Please let me know if you have questions.',
                          'target_language_code': 'ko',
                          'tgt': '궁금한 점이 있으면 알려주세요.'},
                         {'index': 5,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'Sorry for the delay in getting back to you '
                                 'on this.',
                          'target_language_code': 'ko',
                          'tgt': '늦어지는 것에 대해 다시 한번 사과드립니다.'}]}}

잘됨.

다른 예제를 가져다 돌려 봄.

(pt) whyun@k8s-worker-node14:~/workspace/aihub-transformer-translate/letr$ !p
python test.py 
Start LETR Translate ...
{'data': {'candidates': [],
          'charactersToCharge': 118,
          'source_language_code': 'en',
          'target_language_code': 'ko',
          'total': 2,
          'translated': [{'index': 0,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'You see an undead being whose sole purpose '
                                 'is to protect its barrow mound.',
                          'target_language_code': 'ko',
                          'tgt': '무덤을 보호하는 것이 유일한 목적인 언데드가 보입니다.'},
                         {'index': 1,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'Flesh hangs from its wretched body, and it '
                                 'looks more dead than alive',
                          'target_language_code': 'ko',
                          'tgt': '가련한 몸에 살이 매달려 있고 살아있는 것보다 죽은 것처럼 보입니다.'}]}}

잘됨.

캬..

이게 되네.

근데.. 좀 번역이 병맛인 경우도 있네?

{'index': 2,
                          'matchType': 'none',
                          'source_language_code': 'en',
                          'src': 'Do you still have to take the medicine '
                                 'despite destroying the liver?',
                          'target_language_code': 'ko',
                          'tgt': '약을 먹어도 파괴해야 하나요?'}]}}

생각보다 속도가 느린것 같음.

그 이유가 모델을 매번 로딩해서 그런 것 같음. 미리 로딩해서 대기타는게 아니라 요청이 올때마다 로딩하는 듯한 느낌..

—> 응.. 맞네. 일부러 그런 것 같지는 않고… 1회용 스크립트로 짜놓은 것을 Flask로 바로 웹으로 올리면서 그냥 최적화 안하고 붙여버린 것 같음. 즉, Flask 담당자와 인공지능 쪽 담당자가 다른게 아닐까..

그리고 예측할때 cpu만 쓰도록 코드가 짜여져 있어 살짝 바꿔 봄.

def build_model(path):
    params = Params(f"{path}/params.json")
    dataset_name = '_'.join(sorted(
        params.CORPUS_DATA)) + f"_{'_'.join(sorted([params.src_tokenizer_model_name, params.trg_tokenizer_model_name]))}" if type(
        params.CORPUS_DATA) == list else str(
        params.CORPUS_DATA)
    # vocab_only = True if args.mode!='dataset' else False
    dataset = build_dataset(params, dataset_name, select_dataset=[], vocab_only=True)

    src_vocab, trg_vocab = dataset['src_vocab'], dataset['trg_vocab']
    device = params.DEVICE if torch.cuda.is_available() else 'cpu'
    model = initialize_model(params, src_vocab, trg_vocab, device)

    print(f'loading to {device}')
    model.load_state_dict(torch.load(f'{path}/model_deserialized.pt', map_location=device)['model_state_dict'])
    #model.load_state_dict(torch.load(f'{path}/model_deserialized.pt', map_location='cpu')['model_state_dict'])
    model.eval()

    return model, src_vocab, trg_vocab, device

이건 cpu로 돌릴때..

>>>>>>>>>>>>>>>> Translating.... from [en] to [ko]

SRC 0 > You see an undead being whose sole purpose is to protect its barrow mound.
<unk> detected > <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.
Sub-Translated by google:  무덤을 보호하는 것이 유일한 목적인 언데드가 보입니다.
Predict > 무덤을 보호하는 것이 유일한 목적인 언데드가 보입니다.
:: Elapsed time tracker :: [Total run-time of each sentence] - 1.08184 sec.
-------------------------------------------------------

SRC 1 > Flesh hangs from its wretched body, and it looks more dead than alive
<unk> detected > 살아있는 <unk>는 살아있는 것보다 죽은 것으로 보인다.
Sub-Translated by google:  가련한 몸에 살이 매달려 있고 살아있는 것보다 죽은 것처럼 보입니다.
Predict > 가련한 몸에 살이 매달려 있고 살아있는 것보다 죽은 것처럼 보입니다.
:: Elapsed time tracker :: [Total run-time of each sentence] - 0.67282 sec.
-------------------------------------------------------

SRC 2 > What are you doing?
Predict > 뭐 하는 거야?
:: Elapsed time tracker :: [Total run-time of each sentence] - 0.09980 sec.
-------------------------------------------------------
<<<<<<<<<< Repeated words detected...
Transformer_v1 Translated sentence # : 3, Elapsed time : 2.3650307655334473

이건 gpu로 돌릴때..

>>>>>>>>>>>>>>>> Translating.... from [en] to [ko]

SRC 0 > You see an undead being whose sole purpose is to protect its barrow mound.
<unk> detected > <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk>.
Sub-Translated by google:  무덤을 보호하는 것이 유일한 목적인 언데드가 보입니다.
Predict > 무덤을 보호하는 것이 유일한 목적인 언데드가 보입니다.
:: Elapsed time tracker :: [Total run-time of each sentence] - 1.10217 sec.
-------------------------------------------------------

SRC 1 > Flesh hangs from its wretched body, and it looks more dead than alive
<unk> detected > 살아있는 <unk>는 살아있는 것보다 죽은 것으로 보인다.
Sub-Translated by google:  가련한 몸에 살이 매달려 있고 살아있는 것보다 죽은 것처럼 보입니다.
Predict > 가련한 몸에 살이 매달려 있고 살아있는 것보다 죽은 것처럼 보입니다.
:: Elapsed time tracker :: [Total run-time of each sentence] - 0.65226 sec.
-------------------------------------------------------

SRC 2 > What are you doing?
Predict > 뭐 하는 거야?
:: Elapsed time tracker :: [Total run-time of each sentence] - 0.14716 sec.
-------------------------------------------------------
<<<<<<<<<< Repeated words detected...
Transformer_v1 Translated sentence # : 3, Elapsed time : **2.4258382320404053**

뭐여.. CPU가 더 좋네?

신기한 것이 cpu든 gpu든 관계없이 같은 문장을 두번째 돌리면 성능이 빨라짐. 뭐지.

왜 저런걸까. inference는 gpu로 굳이 안 보내도 된다는 말인가보네..

이게 text 에만 해당되는 얘기인지… 다른 영역에서도 해당되는 얘기인지 확인해 보자.
나중에..

pt inference engine을 나중에 onnx로 변환해서 성능을 함 찍어보자.
나중에...

여담으로...

코드에 남겨진 흔적을 뒤지다 보니.. 이 회사가 나오네.

트위그팜이란 회사의 상품 브랜드인가?

https://www.letr.ai/

LETR(레터)

인공지능 기술로 언어의 장벽을 넘어 세상의 모든 콘텐츠를 연결합니다.

www.letr.ai

뭔가 실력있는 회사같다.

약간의 아쉬움

코드를 뒤지다보니… predict에 대한 함수는 있는데 train에 대한 함수는 없네..

흠…. 뭐 좀 아쉽네.. 몇 iteration 더 돌려 보고, 다른 데이터들도 추가해서 학습해 보고 싶었는디...

'인공지능' 카테고리의 다른 글

[찍먹분투기] YOLACT Segmentation (0)	2022.11.06
HuggingFace로 Stable Diffusion 사용하기 (0)	2022.11.06
[요약] 농업분야에서 실제 인공지능 활용 가능 분야 (0)	2022.11.06
[찍먹분투기] KoBART를 이용한 영한 번역 인공지능 만들기 (0)	2022.11.05
[오픈소스] 실시간 목소리 복제 (0)	2022.11.05

몽상꼴레 가로되

[찍먹분투기] AI Hub 일상생활 및 구어체 한-영 번역 말뭉치 데이터 AI 모델 분석

'인공지능' 카테고리의 다른 글

티스토리툴바

[찍먹분투기] AI Hub 일상생활 및 구어체 한-영 번역 말뭉치 데이터 AI 모델 분석

'인공지능' 카테고리의 다른 글

'인공지능' Related Articles

티스토리툴바