The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Type: Package
Title: Cluster Strings by Edit-Distance
Version: 1.0
Author: Dan S. Reznik
Maintainer: Dan S. Reznik <dreznik@gmail.com>
Description: Returns an edit-distance based clusterization of an input vector of strings. Each cluster will contain a set of strings w/ small mutual edit-distance (e.g., Levenshtein, optimum-sequence-alignment, Damerau-Levenshtein), as computed by stringdist::stringdist(). The set of all mutual edit-distances is then used by graph algorithms (from package 'igraph') to single out subsets of high connectivity.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Imports: magrittr, dplyr, stringi, stringr, stringdist, igraph, assertthat, forcats, rlang, tidygraph, ggraph, ggplot2
Depends: R (≥ 3.1)
RoxygenNote: 6.1.1
NeedsCompilation: no
Packaged: 2019-03-26 18:10:58 UTC; dreznik
Repository: CRAN
Date/Publication: 2019-03-30 16:10:03 UTC

Plot string clusters as graph.

Description

Plot string clusters as graph.

Usage

cluster_plot(cluster, min_cluster_size = 2, label_size = 2.5,
  repel = T)

Arguments

cluster

string clusters returned from 'cluster_strings()'

min_cluster_size

minimum size for clusters to be plotted.

label_size

how big should the cluster name fonts be.

repel

whether to "repel" (so cluster names won't overlap)

Value

a graph plot (using 'ggraph') of the string clusters.

Examples

s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça")
s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc")
cluster_plot(s_clust,min_cluster_size=1)

Cluster Strings by Edit-Distance

Description

Cluster Strings by Edit-Distance

Usage

cluster_strings(s_vec, clean = T, method = "osa", max_dist = 3,
  algo = "cc")

Arguments

s_vec

a vector of character strings

clean

whether to space-squish and de-duplicate s_vec

method

one of "osa","lv","dl" (as in 'stringdist')

max_dist

max distance (typically damerau-levenshtein) between related strings.

algo

one of "cc" (connected components) or "eb" (edge betweeness)

Value

a data frame containing cluster membership for each input string

Examples

s_vec <- c("alcool","alcohol","alcoholic","brandy","brandie","cachaça")
s_clust <- cluster_strings(s_vec,method="lv",max_dist=3,algo="cc")
s_clust$df_clusters

Distinct words in Cervantes' "Don Quijote".

Description

Dataframe listing all distinct words (length>3), their length, and frequency of appearance in text.

Usage

quijote_words

Format

A data frame w/ ~22k rows and 3 cols:

word

the unique word, in Spanish

len

the word's length

freq

number of appearances in text

Source

http://www.gutenberg.org/cache/epub/2000/pg2000.txt

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.