split 1 column input into 5 column bed fileHow to extract values from second file on the basis common first column?
I transpose the source code, you transpose the input!
Is population size a parameter, or sample size a statistic?
Why does Captain Marvel in the MCU not have her sash?
If a spaceship ran out of fuel somewhere in space between Earth and Mars, does it slowly drift off to the Sun?
Received a package but didn't order it
How can the dynamic linker/loader itself be dynamically linked as reported by `file`?
London Congestion Charge on A205
Counting files between two corresponding strings in multiple directories
One-digit products in a row of numbers
After viewing logs with journalctl, how do I exit the screen that says "lines 1-2/2 (END)"?
How to stop the death waves in my city?
Can you trip a breaker from a different circuit?
Convert a string of digits from words to an integer
Fix Ethernet 10/100 PoE cable with 7 out of 8 wires alive
Problematic Nature of Views
How to justify getting additional team member when the current team is doing well?
Is it ok if I haven't decided my research topic when I first meet with a potential phd advisor?
How to prepare for a rapid tournament
How can I find Marin?
Population of post-Soviet states. Why decreasing?
Delete n lines skip 1 line script
What does it mean by "my days-of-the-week underwear only go to Thursday" in this context?
What happens to a net with the Returning Weapon artificer infusion after it hits?
A word that refers to saying something in an attempt to anger or embarrass someone into doing something that they don’t want to do?
split 1 column input into 5 column bed file
How to extract values from second file on the basis common first column?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
In f1 below I am trying to split $1 based on each line and create a bed file. If the line is a snp then the : is spilt and the text is $1 the last digit is $2 -1 and $3 and the letter to the left of the > is $4 and the letter to the right is $5. (lines 1 and 2)
If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)
If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)
I am not sure how to format for line 3 (the complex indel)... maybe split $1 on the : and the text is $1 the last digit is $2 and $3 and the letters to the left of the > is $4 and the letters to the right is $5. Thank you :)
f1
chr7:140453145A>T
chr7:140453136A>T
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG
desired
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
awk
awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1
awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1
awk
$endgroup$
add a comment
|
$begingroup$
In f1 below I am trying to split $1 based on each line and create a bed file. If the line is a snp then the : is spilt and the text is $1 the last digit is $2 -1 and $3 and the letter to the left of the > is $4 and the letter to the right is $5. (lines 1 and 2)
If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)
If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)
I am not sure how to format for line 3 (the complex indel)... maybe split $1 on the : and the text is $1 the last digit is $2 and $3 and the letters to the left of the > is $4 and the letters to the right is $5. Thank you :)
f1
chr7:140453145A>T
chr7:140453136A>T
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG
desired
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
awk
awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1
awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1
awk
$endgroup$
2
$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago
$begingroup$
pythonis fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).
$endgroup$
– justaguy
8 hours ago
add a comment
|
$begingroup$
In f1 below I am trying to split $1 based on each line and create a bed file. If the line is a snp then the : is spilt and the text is $1 the last digit is $2 -1 and $3 and the letter to the left of the > is $4 and the letter to the right is $5. (lines 1 and 2)
If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)
If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)
I am not sure how to format for line 3 (the complex indel)... maybe split $1 on the : and the text is $1 the last digit is $2 and $3 and the letters to the left of the > is $4 and the letters to the right is $5. Thank you :)
f1
chr7:140453145A>T
chr7:140453136A>T
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG
desired
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
awk
awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1
awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1
awk
$endgroup$
In f1 below I am trying to split $1 based on each line and create a bed file. If the line is a snp then the : is spilt and the text is $1 the last digit is $2 -1 and $3 and the letter to the left of the > is $4 and the letter to the right is $5. (lines 1 and 2)
If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)
If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)
I am not sure how to format for line 3 (the complex indel)... maybe split $1 on the : and the text is $1 the last digit is $2 and $3 and the letters to the left of the > is $4 and the letters to the right is $5. Thank you :)
f1
chr7:140453145A>T
chr7:140453136A>T
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG
desired
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
awk
awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1
awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1
awk
awk
edited 9 hours ago
justaguy
asked 9 hours ago
justaguyjustaguy
1085 bronze badges
1085 bronze badges
2
$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago
$begingroup$
pythonis fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).
$endgroup$
– justaguy
8 hours ago
add a comment
|
2
$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago
$begingroup$
pythonis fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).
$endgroup$
– justaguy
8 hours ago
2
2
$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago
$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago
$begingroup$
python is fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).$endgroup$
– justaguy
8 hours ago
$begingroup$
python is fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).$endgroup$
– justaguy
8 hours ago
add a comment
|
3 Answers
3
active
oldest
votes
$begingroup$
The following Python script seems to do the job.
#!/usr/bin/env python3
from __future__ import print_function
import re
import sys
for line in sys.stdin:
if '_' in line:
match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
elif 'dup' in line or 'del' in line:
match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
if match.group(3) == 'del':
print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
else:
print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
else:
match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
coord = int(match.group(2))
print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
$begingroup$
Thank you very much:print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new topython. :)
$endgroup$
– justaguy
7 hours ago
1
$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago
1
$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago
add a comment
|
$begingroup$
The Python script in my first response is heavy on regex matching, which is pretty clunky in Python. I like Python much better than Perl overall, but a throwaway script like this will be clearer and more concise in Perl.
#!/usr/bin/env perl
use strict;
while(<STDIN>) dup)([ACGT]+)/)
if ($3 eq "del")
print("$1t$2t$2t$4t-n");
else
print("$1t$2t$2t-t$4n");
elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/)
my $coord2 = $2;
my $coord1 = $2 - 1;
print("$1t$coord1t$coord2t$3t$4n")
It's been several years since I wrote Perl on a regular basis, so there are probably ways to make the script even more clear and concise. (Although I have a love/hate relationship with the default/hidden variables, which can make a script more concise but also harder for a newcomer to understand.)
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
add a comment
|
$begingroup$
Here is an R solution, would probably be slower than the suggested Python solution though:
bed <- readLines("bed.txt")
res <- data.frame()
for(i in bed)
temp <- strsplit(i, ":") %>% unlist()
# case N>N
if(grepl(">", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
end <- begin + 1
ref <- str_extract(temp[[2]], pattern = "[A-Z]")
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del_ins
if(grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
end <- substring(end, 2, nchar(end)) %>% as.integer()
ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
alt <- str_extract(temp[[2]], pattern = "[A-Z]+")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case dup
if(grepl("dup", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- "-"
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del only
if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
alt <- "-"
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
print(res)
temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -
$endgroup$
$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago
add a comment
|
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "676"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f10454%2fsplit-1-column-input-into-5-column-bed-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The following Python script seems to do the job.
#!/usr/bin/env python3
from __future__ import print_function
import re
import sys
for line in sys.stdin:
if '_' in line:
match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
elif 'dup' in line or 'del' in line:
match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
if match.group(3) == 'del':
print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
else:
print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
else:
match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
coord = int(match.group(2))
print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
$begingroup$
Thank you very much:print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new topython. :)
$endgroup$
– justaguy
7 hours ago
1
$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago
1
$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago
add a comment
|
$begingroup$
The following Python script seems to do the job.
#!/usr/bin/env python3
from __future__ import print_function
import re
import sys
for line in sys.stdin:
if '_' in line:
match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
elif 'dup' in line or 'del' in line:
match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
if match.group(3) == 'del':
print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
else:
print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
else:
match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
coord = int(match.group(2))
print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
$begingroup$
Thank you very much:print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new topython. :)
$endgroup$
– justaguy
7 hours ago
1
$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago
1
$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago
add a comment
|
$begingroup$
The following Python script seems to do the job.
#!/usr/bin/env python3
from __future__ import print_function
import re
import sys
for line in sys.stdin:
if '_' in line:
match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
elif 'dup' in line or 'del' in line:
match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
if match.group(3) == 'del':
print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
else:
print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
else:
match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
coord = int(match.group(2))
print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
The following Python script seems to do the job.
#!/usr/bin/env python3
from __future__ import print_function
import re
import sys
for line in sys.stdin:
if '_' in line:
match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
elif 'dup' in line or 'del' in line:
match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
if match.group(3) == 'del':
print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
else:
print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
else:
match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
coord = int(match.group(2))
print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
edited 7 hours ago
answered 8 hours ago
Daniel StandageDaniel Standage
3,4477 silver badges33 bronze badges
3,4477 silver badges33 bronze badges
$begingroup$
Thank you very much:print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new topython. :)
$endgroup$
– justaguy
7 hours ago
1
$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago
1
$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago
add a comment
|
$begingroup$
Thank you very much:print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new topython. :)
$endgroup$
– justaguy
7 hours ago
1
$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago
1
$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago
$begingroup$
Thank you very much:
print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)$endgroup$
– justaguy
7 hours ago
$begingroup$
Thank you very much:
print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)$endgroup$
– justaguy
7 hours ago
1
1
$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago
$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago
1
1
$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago
$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago
add a comment
|
$begingroup$
The Python script in my first response is heavy on regex matching, which is pretty clunky in Python. I like Python much better than Perl overall, but a throwaway script like this will be clearer and more concise in Perl.
#!/usr/bin/env perl
use strict;
while(<STDIN>) dup)([ACGT]+)/)
if ($3 eq "del")
print("$1t$2t$2t$4t-n");
else
print("$1t$2t$2t-t$4n");
elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/)
my $coord2 = $2;
my $coord1 = $2 - 1;
print("$1t$coord1t$coord2t$3t$4n")
It's been several years since I wrote Perl on a regular basis, so there are probably ways to make the script even more clear and concise. (Although I have a love/hate relationship with the default/hidden variables, which can make a script more concise but also harder for a newcomer to understand.)
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
add a comment
|
$begingroup$
The Python script in my first response is heavy on regex matching, which is pretty clunky in Python. I like Python much better than Perl overall, but a throwaway script like this will be clearer and more concise in Perl.
#!/usr/bin/env perl
use strict;
while(<STDIN>) dup)([ACGT]+)/)
if ($3 eq "del")
print("$1t$2t$2t$4t-n");
else
print("$1t$2t$2t-t$4n");
elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/)
my $coord2 = $2;
my $coord1 = $2 - 1;
print("$1t$coord1t$coord2t$3t$4n")
It's been several years since I wrote Perl on a regular basis, so there are probably ways to make the script even more clear and concise. (Although I have a love/hate relationship with the default/hidden variables, which can make a script more concise but also harder for a newcomer to understand.)
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
add a comment
|
$begingroup$
The Python script in my first response is heavy on regex matching, which is pretty clunky in Python. I like Python much better than Perl overall, but a throwaway script like this will be clearer and more concise in Perl.
#!/usr/bin/env perl
use strict;
while(<STDIN>) dup)([ACGT]+)/)
if ($3 eq "del")
print("$1t$2t$2t$4t-n");
else
print("$1t$2t$2t-t$4n");
elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/)
my $coord2 = $2;
my $coord1 = $2 - 1;
print("$1t$coord1t$coord2t$3t$4n")
It's been several years since I wrote Perl on a regular basis, so there are probably ways to make the script even more clear and concise. (Although I have a love/hate relationship with the default/hidden variables, which can make a script more concise but also harder for a newcomer to understand.)
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
$endgroup$
The Python script in my first response is heavy on regex matching, which is pretty clunky in Python. I like Python much better than Perl overall, but a throwaway script like this will be clearer and more concise in Perl.
#!/usr/bin/env perl
use strict;
while(<STDIN>) dup)([ACGT]+)/)
if ($3 eq "del")
print("$1t$2t$2t$4t-n");
else
print("$1t$2t$2t-t$4n");
elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/)
my $coord2 = $2;
my $coord1 = $2 - 1;
print("$1t$coord1t$coord2t$3t$4n")
It's been several years since I wrote Perl on a regular basis, so there are probably ways to make the script even more clear and concise. (Although I have a love/hate relationship with the default/hidden variables, which can make a script more concise but also harder for a newcomer to understand.)
Invoked like so on the command line.
[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -
answered 7 hours ago
Daniel StandageDaniel Standage
3,4477 silver badges33 bronze badges
3,4477 silver badges33 bronze badges
add a comment
|
add a comment
|
$begingroup$
Here is an R solution, would probably be slower than the suggested Python solution though:
bed <- readLines("bed.txt")
res <- data.frame()
for(i in bed)
temp <- strsplit(i, ":") %>% unlist()
# case N>N
if(grepl(">", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
end <- begin + 1
ref <- str_extract(temp[[2]], pattern = "[A-Z]")
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del_ins
if(grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
end <- substring(end, 2, nchar(end)) %>% as.integer()
ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
alt <- str_extract(temp[[2]], pattern = "[A-Z]+")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case dup
if(grepl("dup", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- "-"
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del only
if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
alt <- "-"
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
print(res)
temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -
$endgroup$
$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago
add a comment
|
$begingroup$
Here is an R solution, would probably be slower than the suggested Python solution though:
bed <- readLines("bed.txt")
res <- data.frame()
for(i in bed)
temp <- strsplit(i, ":") %>% unlist()
# case N>N
if(grepl(">", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
end <- begin + 1
ref <- str_extract(temp[[2]], pattern = "[A-Z]")
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del_ins
if(grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
end <- substring(end, 2, nchar(end)) %>% as.integer()
ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
alt <- str_extract(temp[[2]], pattern = "[A-Z]+")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case dup
if(grepl("dup", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- "-"
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del only
if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
alt <- "-"
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
print(res)
temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -
$endgroup$
$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago
add a comment
|
$begingroup$
Here is an R solution, would probably be slower than the suggested Python solution though:
bed <- readLines("bed.txt")
res <- data.frame()
for(i in bed)
temp <- strsplit(i, ":") %>% unlist()
# case N>N
if(grepl(">", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
end <- begin + 1
ref <- str_extract(temp[[2]], pattern = "[A-Z]")
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del_ins
if(grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
end <- substring(end, 2, nchar(end)) %>% as.integer()
ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
alt <- str_extract(temp[[2]], pattern = "[A-Z]+")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case dup
if(grepl("dup", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- "-"
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del only
if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
alt <- "-"
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
print(res)
temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -
$endgroup$
Here is an R solution, would probably be slower than the suggested Python solution though:
bed <- readLines("bed.txt")
res <- data.frame()
for(i in bed)
temp <- strsplit(i, ":") %>% unlist()
# case N>N
if(grepl(">", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
end <- begin + 1
ref <- str_extract(temp[[2]], pattern = "[A-Z]")
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del_ins
if(grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
end <- substring(end, 2, nchar(end)) %>% as.integer()
ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
alt <- str_extract(temp[[2]], pattern = "[A-Z]+")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case dup
if(grepl("dup", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- "-"
alt <- str_extract(temp[[2]], pattern = "[A-Z]$")
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
# case del only
if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]]))
begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
end <- begin
ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
alt <- "-"
temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
res <- rbind(res, temp_res)
print(res)
temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -
answered 7 hours ago
hacihaci
6031 silver badge11 bronze badges
6031 silver badge11 bronze badges
$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago
add a comment
|
$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago
$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago
$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago
add a comment
|
Thanks for contributing an answer to Bioinformatics Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f10454%2fsplit-1-column-input-into-5-column-bed-file%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago
$begingroup$
pythonis fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).$endgroup$
– justaguy
8 hours ago