在本书中,我们经常会使用表示为单词、字符或单词序列的文本数据。首先,我们需要一些基本工具来将原始文本转换为适当形式的序列。典型的预处理流水线执行以下步骤:
-
将文本作为字符串加载到内存中。
-
将字符串拆分为标记(例如,单词或字符)。
-
构建一个词汇词典,将每个词汇元素与一个数字索引相关联。
-
将文本转换为数字索引序列。
import collections
import random
import re
import torch
from d2l import torch as d2l
import collections
import random
import re
import tensorflow as tf
from d2l import tensorflow as d2l
9.2.1. 读取数据集
在这里,我们将使用 HG Wells 的The Time Machine,这是一本 30000 多字的书。虽然实际应用程序通常会涉及大得多的数据集,但这足以演示预处理管道。以下_download
方法将原始文本读入字符串。
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'时间机器,HG Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
为简单起见,我们在预处理原始文本时忽略标点符号和大写字母。
@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
return re.sub('[^A-Za-z]+', ' ', text).lower()
text = data._preprocess(raw_text)
text[:60]
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
9.2.2. 代币化
标记是文本的原子(不可分割)单元。每个时间步对应 1 个 token,但究竟什么是 token 是一种设计选择。例如,我们可以将句子“Baby needs a new pair of shoes”表示为一个包含 7 个单词的序列,其中所有单词的集合包含一个很大的词汇表(通常是数万或数十万个单词)。或者我们将同一个句子表示为更长的 30 个字符序列,使用更小的词汇表(只有 256 个不同的 ASCII 字符)。下面,我们将预处理后的文本标记为一系列字符。
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
9.2.3. 词汇
这些标记仍然是字符串。然而,我们模型的输入最终必须由数值输入组成。接下来,我们介绍一个用于构建词汇表的类,即,将每个不同的标记值与唯一索引相关联的对象。首先,我们确定训练语料库中的唯一标记集。然后我们为每个唯一标记分配一个数字索引。为方便起见,通常会删除不常用的词汇元素。Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special “” token, signifying that this is an unknown value.
class Vocab: #@save
"""Vocabulary for text."""
def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
# Flatten a 2D list if needed
if tokens and isinstance(tokens[0], list):
tokens = [token for line in tokens for token in line]
# Count token frequencies
counter = collections.Counter(tokens)
self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
reverse=True)
# The list of unique tokens
self.idx_to_token = list(sorted(set([''] + reserved_tokens + [
token for token, freq in self.token_freqs if freq >= min_freq])))
self.token_to_idx = {token: idx
for idx, token in enumerate(self.idx_to_token)}
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):
if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens,
评论
查看更多