The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The vignette walks you through importing a variety of different text files into R using the readtext package. Currently, readtext supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF and Microsoft Word formatted files (.pdf, .doc, .docx).
readtext also handles multiple files and file types using for instance a “glob” expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz). Usually, you do not have to determine the format of the files explicitly - readtext takes this information from the file ending.
The readtext package comes with a data directory
called extdata that contains examples of all files listed
above. In the vignette, we use this data directory.
The extdata directory contains several subfolders that
include different text files. In the following examples, we load one or
more files stored in each of these folders. The paste0
command is used to concatenate the extdata folder from the
readtext package with the subfolders. When reading in
custom text files, you will need to determine your own data directory
(see ?setwd()).
The folder “txt” contains a subfolder named UDHR with .txt files of the Universal Declaration of Human Rights in 13 languages.
# Read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
## readtext object consisting of 13 documents and 0 docvars.
## $text
##  [1] "# A data frame: 13 × 2"                                  
##  [2] "  doc_id            text                         "       
##  [3] "  <chr>             <chr>                        "       
##  [4] "1 UDHR_chinese.txt  \"\\\"世界人权宣言\\n联合国\\\"...\""
##  [5] "2 UDHR_czech.txt    \"\\\"VŠEOBECNÁ \\\"...\"          " 
##  [6] "3 UDHR_danish.txt   \"\\\"Den 10. de\\\"...\"          " 
##  [7] "4 UDHR_english.txt  \"\\\"Universal \\\"...\"          " 
##  [8] "5 UDHR_french.txt   \"\\\"Déclaratio\\\"...\"          " 
##  [9] "6 UDHR_georgian.txt \"\\\"FLFVBFYBC \\\"...\"          " 
## [10] "# ℹ 7 more rows"                                         
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"We can specify document-level metadata (docvars) based
on the file names or on a separate data.frame. Below we take the docvars
from the filenames (docvarsfrom = "filenames") and set the
names for each variable
(docvarnames = c("unit", "context", "year", "language", "party")).
The command dvsep = "_" determines the separator (a regular
expression character string) included in the filenames to delimit the
docvar elements.
# Manifestos with docvars from filenames
readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
         docvarsfrom = "filenames", 
         docvarnames = c("unit", "context", "year", "language", "party"),
         dvsep = "_", 
         encoding = "ISO-8859-1")
## readtext object consisting of 17 documents and 5 docvars.
## $text
##  [1] "# A data frame: 17 × 7"                                                                
##  [2] "  doc_id                  text                unit  context  year language party"      
##  [3] "  <chr>                   <chr>               <chr> <chr>   <int> <chr>    <chr>"      
##  [4] "1 EU_euro_2004_de_PSE.txt \"\\\"PES · PSE \\\"...\" EU    euro     2004 de       PSE  "
##  [5] "2 EU_euro_2004_de_V.txt   \"\\\"Gemeinsame\\\"...\" EU    euro     2004 de       V    "
##  [6] "3 EU_euro_2004_en_PSE.txt \"\\\"PES · PSE \\\"...\" EU    euro     2004 en       PSE  "
##  [7] "4 EU_euro_2004_en_V.txt   \"\\\"Manifesto\\n\\\"..… EU    euro     2004 en       V    "
##  [8] "5 EU_euro_2004_es_PSE.txt \"\\\"PES · PSE \\\"...\" EU    euro     2004 es       PSE  "
##  [9] "6 EU_euro_2004_es_V.txt   \"\\\"Manifesto\\n\\\"..… EU    euro     2004 es       V    "
## [10] "# ℹ 11 more rows"                                                                      
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"readtext can also curse through subdirectories. In
our example, the folder txt/movie_reviews contains two
subfolders (called neg and pos). We can load
all texts included in both folders.
# Recurse through subdirectories
readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"))
## readtext object consisting of 10 documents and 0 docvars.
## $text
##  [1] "# A data frame: 10 × 2"                            
##  [2] "  doc_id              text                "        
##  [3] "  <chr>               <chr>               "        
##  [4] "1 neg_cv000_29416.txt \"\\\"plot : two\\\"...\" "  
##  [5] "2 neg_cv001_19502.txt \"\\\"the happy \\\"...\" "  
##  [6] "3 neg_cv002_17424.txt \"\\\"it is movi\\\"...\" "  
##  [7] "4 neg_cv003_12683.txt \"\\\" \\\" quest f\\\"...\""
##  [8] "5 neg_cv004_12641.txt \"\\\"synopsis :\\\"...\" "  
##  [9] "6 pos_cv000_29590.txt \"\\\"films adap\\\"...\" "  
## [10] "# ℹ 4 more rows"                                   
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"Read in comma separated values (.csv files) that contain textual
data. We determine the texts variable in our .csv file as
the text_field. This is the column that contains the actual
text. The other columns of the original csv file (Year,
President, FirstName) are by default treated
as document-level variables.
# Read in comma-separated values
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
## readtext object consisting of 5 documents and 3 docvars.
## $text
## [1] "# A data frame: 5 × 5"                                                   
## [2] "  doc_id            text                 Year President  FirstName"      
## [3] "  <chr>             <chr>               <int> <chr>      <chr>    "      
## [4] "1 inaugCorpus.csv.1 \"\\\"Fellow-Cit\\\"...\"  1789 Washington George   "
## [5] "2 inaugCorpus.csv.2 \"\\\"Fellow cit\\\"...\"  1793 Washington George   "
## [6] "3 inaugCorpus.csv.3 \"\\\"When it wa\\\"...\"  1797 Adams      John     "
## [7] "4 inaugCorpus.csv.4 \"\\\"Friends an\\\"...\"  1801 Jefferson  Thomas   "
## [8] "5 inaugCorpus.csv.5 \"\\\"Proceeding\\\"...\"  1805 Jefferson  Thomas   "
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"The same procedure applies to tab-separated values.
# Read in tab-separated values
readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")
## readtext object consisting of 33 documents and 9 docvars.
## $text
##  [1] "# A data frame: 33 × 11"                                                            
##  [2] "  doc_id         text  speechID memberID partyID constID title date  member_name"   
##  [3] "  <chr>          <chr>    <int>    <int>   <int>   <int> <chr> <chr> <chr>      "   
##  [4] "1 dailsample.ts… \"\\\"M…        1      977      22     158 1. C… 1919… Count Geor…"
##  [5] "2 dailsample.ts… \"\\\"I…        2     1603      22     103 1. C… 1919… Mr. Pádrai…"
##  [6] "3 dailsample.ts… \"\\\"'…        3      116      22     178 1. C… 1919… Mr. Cathal…"
##  [7] "4 dailsample.ts… \"\\\"T…        4      116      22     178 2. C… 1919… Mr. Cathal…"
##  [8] "5 dailsample.ts… \"\\\"L…        5      116      22     178 3. A… 1919… Mr. Cathal…"
##  [9] "6 dailsample.ts… \"\\\"-…        6      116      22     178 3. A… 1919… Mr. Cathal…"
## [10] "# ℹ 27 more rows"                                                                   
## [11] "# ℹ 2 more variables: party_name <chr>, const_name <chr>"                           
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"You can also read .json data. Again you need to specify the
text_field.
## Read in JSON data
readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")
## readtext object consisting of 3 documents and 3 docvars.
## $text
## [1] "# A data frame: 3 × 5"                                                         
## [2] "  doc_id                  text                 Year President  FirstName"      
## [3] "  <chr>                   <chr>               <int> <chr>      <chr>    "      
## [4] "1 inaugural_sample.json.1 \"\\\"Fellow-Cit\\\"...\"  1789 Washington George   "
## [5] "2 inaugural_sample.json.2 \"\\\"Fellow cit\\\"...\"  1793 Washington George   "
## [6] "3 inaugural_sample.json.3 \"\\\"When it wa\\\"...\"  1797 Adams      John     "
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"readtext can also read in and convert .pdf files.
In the example below we load all .pdf files stored in the
UDHR folder, and determine that the docvars
shall be taken from the filenames. We call the document-level variables
document and language, and specify the
delimiter (dvsep).
## Read in Universal Declaration of Human Rights pdf files
(rt_pdf <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), 
                    docvarsfrom = "filenames", 
                    docvarnames = c("document", "language"),
                    sep = "_"))
## readtext object consisting of 11 documents and 2 docvars.
## $text
##  [1] "# A data frame: 11 × 4"                                                    
##  [2] "  doc_id           text                          document language"        
##  [3] "  <chr>            <chr>                         <chr>    <chr>   "        
##  [4] "1 UDHR_chinese.pdf \"\\\"世界人权宣言\\n\\n联合\\\"...\" UDHR     chinese "
##  [5] "2 UDHR_czech.pdf   \"\\\"VŠEOBECNÁ \\\"...\"           UDHR     czech   "  
##  [6] "3 UDHR_danish.pdf  \"\\\"Den 10. de\\\"...\"           UDHR     danish  "  
##  [7] "4 UDHR_english.pdf \"\\\"Universal \\\"...\"           UDHR     english "  
##  [8] "5 UDHR_french.pdf  \"\\\"Déclaratio\\\"...\"           UDHR     french  "  
##  [9] "6 UDHR_greek.pdf   \"\\\"ΟΙΚΟΥΜΕΝΙΚ\\\"...\"           UDHR     greek   "  
## [10] "# ℹ 5 more rows"                                                           
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"Microsoft Word formatted files are converted through the package
antiword for older .doc files, and using
XML for newer .docx files.
## Read in Word data (.docx)
readtext(paste0(DATA_DIR, "/word/*.docx"))
## readtext object consisting of 2 documents and 0 docvars.
## $text
## [1] "# A data frame: 2 × 2"                                  
## [2] "  doc_id                      text               "      
## [3] "  <chr>                       <chr>              "      
## [4] "1 UK_2015_EccentricParty.docx \"\\\"The Eccent\\\"...\""
## [5] "2 UK_2015_LoonyParty.docx     \"\\\"The Offici\\\"...\""
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"You can also read in text directly from a URL.
readtext was originally developed in early versions
of the quanteda
package for the quantitative analysis of textual data. It was spawned
from the textfile() function from that package, and now
lives exclusively in readtext. Because
quanteda’s corpus constructor recognizes the data.frame
format returned by readtext(), it can construct a corpus
directly from a readtext object, preserving all docvars and
other meta-data.
You can easily construct a corpus from a readtext object.
if (require("quanteda")) {
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
}
## Loading required package: quanteda
## Package version: 4.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: 10 of 10 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:readtext':
## 
##     texts
## Corpus consisting of 5 documents, showing 5 documents:
## 
##               Text Types Tokens Sentences Year  President FirstName
##  inaugCorpus.csv.1   625   1540        23 1789 Washington    George
##  inaugCorpus.csv.2    96    147         4 1793 Washington    George
##  inaugCorpus.csv.3   826   2578        37 1797      Adams      John
##  inaugCorpus.csv.4   717   1927        41 1801  Jefferson    Thomas
##  inaugCorpus.csv.5   804   2381        45 1805  Jefferson    ThomasWhen a document contains page numbers, they are imported as well. If you want to remove them, you can use a regular expression. We strongly recommend using the stringi package. For the most common regular expressions you can look at this cheatsheet.
You first need to check in the original file in which format the page
numbers occur (e.g., “1”, “-1-”, “page 1” etc.). We can make use of the
fact that page numbers are almost always preceded and followed by a
linebreak (\n). After loading the text with
readtext, you can replace the page numbers.
In the first example, the page numbers have the format “page X”.
# Make some text with page numbers
sample_text_a <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, 
page 1 
with the newspaper from a boy named quick Seamus, in his mouth.
page 2
The quicker brown fox jumped over 2 lazy dogs."
sample_text_a
## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, \npage 1 \nwith the newspaper from a boy named quick Seamus, in his mouth.\npage 2\nThe quicker brown fox jumped over 2 lazy dogs."
# Remove "page" and respective digit
sample_text_a2 <- unlist(stri_split_fixed(sample_text_a, '\n'), use.names = FALSE)
sample_text_a2 <- stri_replace_all_regex(sample_text_a2, "page \\d*", "")
sample_text_a2 <- stri_trim_both(sample_text_a2)
sample_text_a2 <- sample_text_a2[sample_text_a2 != '']
stri_paste(sample_text_a2, collapse = '\n')
## [1] "The quick brown fox named Seamus jumps over the lazy dog also named Seamus,\nwith the newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."In the second example we remove page numbers which have the format “- X -”.
sample_text_b <- "The quick brown fox named Seamus 
- 1 - 
jumps over the lazy dog also named Seamus, with 
- 2 - 
the newspaper from a boy named quick Seamus, in his mouth. 
- 33 - 
The quicker brown fox jumped over 2 lazy dogs."
sample_text_b
## [1] "The quick brown fox named Seamus \n- 1 - \njumps over the lazy dog also named Seamus, with \n- 2 - \nthe newspaper from a boy named quick Seamus, in his mouth. \n- 33 - \nThe quicker brown fox jumped over 2 lazy dogs."
sample_text_b2 <- unlist(stri_split_fixed(sample_text_b, '\n'), use.names = FALSE)
sample_text_b2 <- stri_replace_all_regex(sample_text_b2, "[-] \\d* [-]", "")
sample_text_b2 <- stri_trim_both(sample_text_b2)
sample_text_b2 <- sample_text_b2[sample_text_b2 != '']
stri_paste(sample_text_b2, collapse = '\n')
## [1] "The quick brown fox named Seamus\njumps over the lazy dog also named Seamus, with\nthe newspaper from a boy named quick Seamus, in his mouth.\nThe quicker brown fox jumped over 2 lazy dogs."Such stringi functions can also be applied to readtext objects.
Sometimes files of the same type have different encodings. If the encoding of a file is included in the file name, we can extract this information and import the texts correctly.
# create a temporary directory to extract the .zip file
FILEDIR <- tempdir()
# unzip file
unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR)Here, we will get the encoding from the filenames themselves.
# get encoding from filename
filenames <- list.files(FILEDIR, "^(Indian|UDHR_).*\\.txt$")
head(filenames)
## [1] "IndianTreaty_English_UTF-16LE.txt"  "IndianTreaty_English_UTF-8-BOM.txt"
## [3] "UDHR_Arabic_ISO-8859-6.txt"         "UDHR_Arabic_UTF-8.txt"             
## [5] "UDHR_Arabic_WINDOWS-1256.txt"       "UDHR_Chinese_GB2312.txt"
# Strip the extension
filenames <- gsub(".txt$", "", filenames)
parts <- strsplit(filenames, "_")
fileencodings <- sapply(parts, "[", 3)
head(fileencodings)
## [1] "UTF-16LE"     "UTF-8-BOM"    "ISO-8859-6"   "UTF-8"        "WINDOWS-1256"
## [6] "GB2312"
# Check whether certain file encodings are not supported
notAvailableIndex <- which(!(fileencodings %in% iconvlist()))
fileencodings[notAvailableIndex]
## [1] "UTF-8-BOM"If we read the text files without specifying the encoding, we get
erroneously formatted text. To avoid this, we determine the
encoding using the character object
fileencoding created above.
We can also add docvars based on the filenames.
txts <- readtext(paste0(DATA_DIR, "/data_files_encodedtexts.zip"), 
                 encoding = fileencodings,
                 docvarsfrom = "filenames", 
                 docvarnames = c("document", "language", "input_encoding"))
print(txts, n = 50)
## readtext object consisting of 36 documents and 3 docvars.
## $text
##  [1] "# A data frame: 36 × 5"                                                                  
##  [2] "   doc_id                             text      document language input_encoding"        
##  [3] "   <chr>                              <chr>     <chr>    <chr>    <chr>         "        
##  [4] " 1 IndianTreaty_English_UTF-16LE.txt  \"\\\"WHERE… IndianT… English  UTF-16LE      "     
##  [5] " 2 IndianTreaty_English_UTF-8-BOM.txt \"\\\"ARTIC… IndianT… English  UTF-8-BOM     "     
##  [6] " 3 UDHR_Arabic_ISO-8859-6.txt         \"\\\"الديب… UDHR     Arabic   ISO-8859-6    "     
##  [7] " 4 UDHR_Arabic_UTF-8.txt              \"\\\"الديب… UDHR     Arabic   UTF-8         "     
##  [8] " 5 UDHR_Arabic_WINDOWS-1256.txt       \"\\\"الديب… UDHR     Arabic   WINDOWS-1256  "     
##  [9] " 6 UDHR_Chinese_GB2312.txt            \"\\\"世界人权宣… UDHR     Chinese  GB2312        "
## [10] " 7 UDHR_Chinese_GBK.txt               \"\\\"世界人权宣… UDHR     Chinese  GBK           "
## [11] " 8 UDHR_Chinese_UTF-8.txt             \"\\\"世界人权宣… UDHR     Chinese  UTF-8         "
## [12] " 9 UDHR_English_UTF-16BE.txt          \"\\\"Unive… UDHR     English  UTF-16BE      "     
## [13] "10 UDHR_English_UTF-16LE.txt          \"\\\"Unive… UDHR     English  UTF-16LE      "     
## [14] "11 UDHR_English_UTF-8.txt             \"\\\"Unive… UDHR     English  UTF-8         "     
## [15] "12 UDHR_English_WINDOWS-1252.txt      \"\\\"Unive… UDHR     English  WINDOWS-1252  "     
## [16] "13 UDHR_French_ISO-8859-1.txt         \"\\\"Décla… UDHR     French   ISO-8859-1    "     
## [17] "14 UDHR_French_UTF-8.txt              \"\\\"Décla… UDHR     French   UTF-8         "     
## [18] "15 UDHR_French_WINDOWS-1252.txt       \"\\\"Décla… UDHR     French   WINDOWS-1252  "     
## [19] "16 UDHR_German_ISO-8859-1.txt         \"\\\"Die A… UDHR     German   ISO-8859-1    "     
## [20] "17 UDHR_German_UTF-8.txt              \"\\\"Die A… UDHR     German   UTF-8         "     
## [21] "18 UDHR_German_WINDOWS-1252.txt       \"\\\"Die A… UDHR     German   WINDOWS-1252  "     
## [22] "19 UDHR_Greek_CP1253.txt              \"\\\"ΟΙΚΟΥ… UDHR     Greek    CP1253        "     
## [23] "20 UDHR_Greek_ISO-8859-7.txt          \"\\\"ΟΙΚΟΥ… UDHR     Greek    ISO-8859-7    "     
## [24] "21 UDHR_Greek_UTF-8.txt               \"\\\"ΟΙΚΟΥ… UDHR     Greek    UTF-8         "     
## [25] "22 UDHR_Hindi_UTF-8.txt               \"\\\"मानव अ… UDHR     Hindi    UTF-8         "    
## [26] "23 UDHR_Icelandic_ISO-8859-1.txt      \"\\\"Mannr… UDHR     Iceland… ISO-8859-1    "     
## [27] "24 UDHR_Icelandic_UTF-8.txt           \"\\\"Mannr… UDHR     Iceland… UTF-8         "     
## [28] "25 UDHR_Icelandic_WINDOWS-1252.txt    \"\\\"Mannr… UDHR     Iceland… WINDOWS-1252  "     
## [29] "26 UDHR_Japanese_CP932.txt            \"\\\"『世界人権… UDHR     Japanese CP932         "
## [30] "27 UDHR_Japanese_ISO-2022-JP.txt      \"\\\"『世界人権… UDHR     Japanese ISO-2022-JP   "
## [31] "28 UDHR_Japanese_UTF-8.txt            \"\\\"『世界人権… UDHR     Japanese UTF-8         "
## [32] "29 UDHR_Japanese_WINDOWS-936.txt      \"\\\"『世界人権… UDHR     Japanese WINDOWS-936   "
## [33] "30 UDHR_Korean_ISO-2022-KR.txt        \"\\\"세 계 인… UDHR     Korean   ISO-2022-KR   "  
## [34] "31 UDHR_Korean_UTF-8.txt              \"\\\"세 계 인… UDHR     Korean   UTF-8         "  
## [35] "32 UDHR_Russian_ISO-8859-5.txt        \"\\\"Всеоб… UDHR     Russian  ISO-8859-5    "     
## [36] "33 UDHR_Russian_KOI8-R.txt            \"\\\"Всеоб… UDHR     Russian  KOI8-R        "     
## [37] "34 UDHR_Russian_UTF-8.txt             \"\\\"Всеоб… UDHR     Russian  UTF-8         "     
## [38] "35 UDHR_Russian_WINDOWS-1251.txt      \"\\\"Всеоб… UDHR     Russian  WINDOWS-1251  "     
## [39] "36 UDHR_Thai_UTF-8.txt                \"\\\"ปฏิญญา… UDHR     Thai     UTF-8         "     
## 
## $summary
## $summary[[1]]
## NULL
## 
## 
## attr(,"class")
## [1] "trunc_mat"From this file we can easily create a quanteda
corpus object.
if (require("quanteda")) {
corpus_txts <- corpus(txts)
summary(corpus_txts, 5)
}
## Corpus consisting of 36 documents, showing 5 documents:
## 
##                                Text Types Tokens Sentences     document
##   IndianTreaty_English_UTF-16LE.txt   619   2578       152 IndianTreaty
##  IndianTreaty_English_UTF-8-BOM.txt   646   3090       150 IndianTreaty
##          UDHR_Arabic_ISO-8859-6.txt   753   1555        86         UDHR
##               UDHR_Arabic_UTF-8.txt   753   1555        86         UDHR
##        UDHR_Arabic_WINDOWS-1256.txt   753   1555        86         UDHR
##  language input_encoding
##   English       UTF-16LE
##   English      UTF-8-BOM
##    Arabic     ISO-8859-6
##    Arabic          UTF-8
##    Arabic   WINDOWS-1256These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.