How to know the operations made to calculate the Levenshtein distance between strings?What is the difference between String and string in C#?How to check if a string contains a substring in BashHow do I iterate over the words of a string?How do I read / convert an InputStream into a String in Java?How to substring a string in Python?How do I make the first letter of a string uppercase in JavaScript?How to replace all occurrences of a string in JavaScriptHow to check whether a string contains a substring in JavaScript?How do I check if a string contains a specific word?How do I convert a String to an int in Java?
usage of y" not just for locations?
Are the plates of a battery really charged?
What is this fluorinated organic substance?
Is it normal for professors to hold their graduate students "hostage?"
Is there a connection between representation theory and PDEs?
What is the point of using the kunai?
Why will we fail creating a self sustaining off world colony?
Is my background sufficient to start Quantum Computing
What's the idiomatic (or best) way to trim surrounding whitespace from a string?
How can solar sailed ships be protected from space debris?
Why did the Middle Kingdom stop building pyramid tombs?
What happened to the Apollo 1 rocket?
How to extract coefficients of a generating function like this one, using a computer?
Find the closest three-digit hex colour
Which high-degree derivatives play an essential role?
German idiomatic equivalents of 能骗就骗 (if you can cheat, then cheat)
Making arrow with a gradual colour
Can I hire several veteran soldiers to accompany me?
Variable declaration inside main loop
Odd PCB Layout for Voltage Regulator
Replacing 5 gang light switches that have 3 of them daisy chained together
Are the Gray and Death Slaad's Bite and Claw attacks magical?
GFCI versus circuit breaker
Finding an optimal set without forbidden subsets
How to know the operations made to calculate the Levenshtein distance between strings?
What is the difference between String and string in C#?How to check if a string contains a substring in BashHow do I iterate over the words of a string?How do I read / convert an InputStream into a String in Java?How to substring a string in Python?How do I make the first letter of a string uppercase in JavaScript?How to replace all occurrences of a string in JavaScriptHow to check whether a string contains a substring in JavaScript?How do I check if a string contains a specific word?How do I convert a String to an int in Java?
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
New contributor
add a comment |
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
New contributor
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
add a comment |
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
New contributor
With the function stringdist
, I can calculate the Levenshtein distance between strings : it counts the number of deletions, insertions and substitutions necessary to turn a string into another. For instance, stringdist("abc abc","abcd abc") = 1
because "d" was inserted in the second string.
Is it possible to know the operations made to obtain the Levenshtein distance between two strings ? Or else to know the characters that are different between the 2 strings (in this example, only "d")?
Thanks.
library(stringdist)
stringdist("abc abc","abcde acc") = 3
I would like to know that :
"d" was inserted
"e" was inserted
"b" was substitued into "c"
Or more simply, I would like to have the list ("d","e","c").
r string levenshtein-distance stringdist
r string levenshtein-distance stringdist
New contributor
New contributor
edited 7 hours ago
Konrad Rudolph
411k103 gold badges805 silver badges1051 bronze badges
411k103 gold badges805 silver badges1051 bronze badges
New contributor
asked 8 hours ago
yakiyaki
434 bronze badges
434 bronze badges
New contributor
New contributor
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
add a comment |
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
1
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago
add a comment |
3 Answers
3
active
oldest
votes
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE)
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package Biostrings, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
yaki is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56827772%2fhow-to-know-the-operations-made-to-calculate-the-levenshtein-distance-between-st%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
With adist()
, you can retrieve the operations:
drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))
ins del sub
2 0 1
From ?adist
:
If counts is TRUE, the transformation counts are returned as the
"counts" attribute of this matrix, as a 3-dimensional array with
dimensions corresponding to the elements of x, the elements of y, and
the type of transformation (insertions, deletions and substitutions),
respectively.
edited 8 hours ago
answered 8 hours ago
tmfmnktmfmnk
7,8981 gold badge8 silver badges21 bronze badges
7,8981 gold badge8 silver badges21 bronze badges
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
add a comment |
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function usingattr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
whereM=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
Don't know about any handy function that will do it. However, I assume that playing aroundtrafos
will lead you to the desired results.
– tmfmnk
8 hours ago
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using
attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
where M=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
Thanks it helps me a lot! Do you know if there is a function to directly know the characters corresponding to these operations ? Else, I could try to create a function using
attr(adist("abda cc","abc abc", count = TRUE),"trafos") #= "MMSDMSIM"
where M=match, S=substitute, D=delete, I=insert
– yaki
8 hours ago
1
1
Don't know about any handy function that will do it. However, I assume that playing around
trafos
will lead you to the desired results.– tmfmnk
8 hours ago
Don't know about any handy function that will do it. However, I assume that playing around
trafos
will lead you to the desired results.– tmfmnk
8 hours ago
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE)
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE)
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
add a comment |
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE)
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
Building off tmfmnk's answer and the suggestion to play around with the "trafos" attribute, here's a function which will show you a table of all the characters inserted or substituted, and how many times they were inserted and substituted. If you set all_actions = T
it will show you matches also.
f <- function(x, y, all_actions = FALSE)
o <- adist(x, y, count = TRUE)
cva <-
list(char = strsplit(y, '')[[1]],
action = strsplit(attr(o,"trafos"), '')[[1]])
if(!all_actions)
cva <- lapply(cva, '[', cva$action %in% c('I', 'S'))
do.call(table, cva)
f(x = "abc abc", y = "abcde acc")
# action
# char I S
# c 0 1
# d 1 0
# e 1 0
f(x = "abc abc", y = "abcde acc", all_actions = T)
# action
# char I M S
# 0 1 0
# a 0 2 0
# b 0 1 0
# c 0 2 1
# d 1 0 0
# e 1 0 0
edited 5 hours ago
answered 7 hours ago
IceCreamToucanIceCreamToucan
12.7k1 gold badge8 silver badges19 bronze badges
12.7k1 gold badge8 silver badges19 bronze badges
add a comment |
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package Biostrings, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package Biostrings, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package Biostrings, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
This is known as the Needleman–Wunsch algorithm. It calculates both the distance between two strings as well as the so-called traceback, which allows you to reconstruct the alignment.
Since this problem mostly crops up in biology when comparing biological sequences, this algorithm (and related ones) are implemented in the R package Biostrings, which is part of Bioconductor.
Since this package implements are more general solution than the simple Levenshtein distance, the usage is unfortunately more complex, and the usage vignette is correspondingly long. But the fundamental usage for your purposes is as follows:
library(Biostrings)
dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')
result = pairwiseAlignment(
"abc abc", "abcde acc",
substitutionMatrix = dist_mat,
gapOpening = 1, gapExtension = 1
)
This won’t simply give you the list c('b', 'c', 'c')
, though, because that list does not fully represent what actually happened here. Instead, it will return an alignment between the two strings. This can be represented as a sequence with substitutions and gaps:
score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a" "b" "c" "-" "-" " " "a" "b" "c"
aligned(result)
— For each character in the second string it provides the corresponding character in the original string, replacing inserted characters by -
. Basically, this is a “recipe” for transforming the first string into the second string. Note that it will only contain insertions and substitutions, not deletions. To get these, you need to perform the alignment the other way round (i.e. swapping the string arguments).
answered 6 hours ago
Konrad RudolphKonrad Rudolph
411k103 gold badges805 silver badges1051 bronze badges
411k103 gold badges805 silver badges1051 bronze badges
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
add a comment |
Unfortunately the code above requires you to specifydist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.
– Konrad Rudolph
6 hours ago
Unfortunately the code above requires you to specify
dist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.– Konrad Rudolph
6 hours ago
Unfortunately the code above requires you to specify
dist_mat
manually such that it contains one row and column for each character that your string might contain. The code shown in this answer thus only allows lower-case letters and spaces, nothing else.– Konrad Rudolph
6 hours ago
add a comment |
yaki is a new contributor. Be nice, and check out our Code of Conduct.
yaki is a new contributor. Be nice, and check out our Code of Conduct.
yaki is a new contributor. Be nice, and check out our Code of Conduct.
yaki is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56827772%2fhow-to-know-the-operations-made-to-calculate-the-levenshtein-distance-between-st%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I don't know of any R package that does what you want. Do you really need it or are you asking for pedagogic purposes? In any case, see the Wikipedia has to say on this. And upvote.
– Rui Barradas
8 hours ago
It could help me for my research because I'm trying to know the differences between strings. Thanks for the link
– yaki
8 hours ago
@RuiBarradas Not only do such packages exist, their existence is the major reason for R’s popularity today. :-)
– Konrad Rudolph
7 hours ago