split 1 column input into 5 column bed fileHow to extract values from second file on the basis common first column?

I transpose the source code, you transpose the input!

Is population size a parameter, or sample size a statistic?

Why does Captain Marvel in the MCU not have her sash?

If a spaceship ran out of fuel somewhere in space between Earth and Mars, does it slowly drift off to the Sun?

Received a package but didn't order it

How can the dynamic linker/loader itself be dynamically linked as reported by `file`?

London Congestion Charge on A205

Counting files between two corresponding strings in multiple directories

One-digit products in a row of numbers

After viewing logs with journalctl, how do I exit the screen that says "lines 1-2/2 (END)"?

How to stop the death waves in my city?

Can you trip a breaker from a different circuit?

Convert a string of digits from words to an integer

Fix Ethernet 10/100 PoE cable with 7 out of 8 wires alive

Problematic Nature of Views

How to justify getting additional team member when the current team is doing well?

Is it ok if I haven't decided my research topic when I first meet with a potential phd advisor?

How to prepare for a rapid tournament

How can I find Marin?

Population of post-Soviet states. Why decreasing?

Delete n lines skip 1 line script

What does it mean by "my days-of-the-week underwear only go to Thursday" in this context?

What happens to a net with the Returning Weapon artificer infusion after it hits?

A word that refers to saying something in an attempt to anger or embarrass someone into doing something that they don’t want to do?

split 1 column input into 5 column bed file

How to extract values from second file on the basis common first column?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

In f1 below I am trying to split $1 based on each line and create a bed file. If the line is a snp then the : is spilt and the text is $1 the last digit is $2 -1 and $3 and the letter to the left of the > is $4 and the letter to the right is $5. (lines 1 and 2)

If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)

If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)

I am not sure how to format for line 3 (the complex indel)... maybe split $1 on the : and the text is $1 the last digit is $2 and $3 and the letters to the left of the > is $4 and the letters to the right is $5. Thank you :)

chr7:140453145A>T
chr7:140453136A>T 
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG

desired

chr7 140453144 140453145 A T 
chr7 140453135 140453136 A T 
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

awk

awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1

awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1

edited 9 hours ago

asked 9 hours ago

justaguy

1085 bronze badges

2

$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago

$begingroup$
python is fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).
$endgroup$
– justaguy
8 hours ago

add a comment
|

If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)

If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)

chr7:140453145A>T
chr7:140453136A>T 
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG

desired

chr7 140453144 140453145 A T 
chr7 140453135 140453136 A T 
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

awk

awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1

awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1

edited 9 hours ago

asked 9 hours ago

justaguy

1085 bronze badges

2

$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago

$begingroup$
python is fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).
$endgroup$
– justaguy
8 hours ago

add a comment
|

If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)

If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)

chr7:140453145A>T
chr7:140453136A>T 
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG

desired

chr7 140453144 140453145 A T 
chr7 140453135 140453136 A T 
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

awk

awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1

awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1

edited 9 hours ago

asked 9 hours ago

justaguy

1085 bronze badges

If the line has dup then the : is spilt and the text is $1 the last digit is $2 and $3 and the $4 is - and the letter to the right is $5. (line 4)

If the line has del with nothing else then the : is spilt and the text is $1 the last digit is $2 and $3 and the letter to the left of the > is $4 and $5 is -. (line 5)

chr7:140453145A>T
chr7:140453136A>T 
chr7:140453135_140453136delCAinsTT
chr20:31022287dupA
chr19:13054614delG

desired

chr7 140453144 140453145 A T 
chr7 140453135 140453136 A T 
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

awk

awk 'BEGINOFS="t" sub(/[^0-9]+$/, "", $1)' f1

awk 'BEGIN FS = "[ -]"; OFS="t" NF==3 print $1, $2 - 1, $2, $3 NF==4 print $1, $2, $3, $4 ' f1

awk

edited 9 hours ago

asked 9 hours ago

justaguy

1085 bronze badges

edited 9 hours ago

asked 9 hours ago

justaguy

1085 bronze badges

edited 9 hours ago

asked 9 hours ago

justaguy

1085 bronze badges

asked 9 hours ago

justaguy

1085 bronze badges

asked 9 hours ago

justaguy

1085 bronze badges

2

$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago

$begingroup$
python is fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).
$endgroup$
– justaguy
8 hours ago

add a comment
|

2

$begingroup$
Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.
$endgroup$
– Devon Ryan♦
9 hours ago

$begingroup$
python is fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).
$endgroup$
– justaguy
8 hours ago

Do you have to do this in awk for some reason? Sure it's possible, but there are enough cases that'd it'd make more sense to use python that's easier to write longer scripts in.

– Devon Ryan♦
9 hours ago

python is fine I just don't know it as well.... only the basics but I am interested to see what it would look like as I can learn from it. Thank you :).

– justaguy
8 hours ago

add a comment
|

3 Answers
3

active

oldest

votes

The following Python script seems to do the job.

#!/usr/bin/env python3

from __future__ import print_function
import re
import sys

for line in sys.stdin:
 if '_' in line:
 match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
 print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
 elif 'dup' in line or 'del' in line:
 match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
 if match.group(3) == 'del':
 print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
 else:
 print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
 else:
 match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
 coord = int(match.group(2))
 print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

edited 7 hours ago

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

$begingroup$
Thank you very much: print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)
$endgroup$
– justaguy
7 hours ago

1

$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago

1

$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago

add a comment
|

The Python script in my first response is heavy on regex matching, which is pretty clunky in Python. I like Python much better than Perl overall, but a throwaway script like this will be clearer and more concise in Perl.

#!/usr/bin/env perl
use strict;

while(<STDIN>) dup)([ACGT]+)/) 
 if ($3 eq "del") 
 print("$1t$2t$2t$4t-n");
 
 else 
 print("$1t$2t$2t-t$4n");
 
 
 elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/) 
 my $coord2 = $2;
 my $coord1 = $2 - 1;
 print("$1t$coord1t$coord2t$3t$4n")

It's been several years since I wrote Perl on a regular basis, so there are probably ways to make the script even more clear and concise. (Although I have a love/hate relationship with the default/hidden variables, which can make a script more concise but also harder for a newcomer to understand.)

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

add a comment
|

Here is an R solution, would probably be slower than the suggested Python solution though:

bed <- readLines("bed.txt")

res <- data.frame()

for(i in bed) 

 temp <- strsplit(i, ":") %>% unlist()

 # case N>N

 if(grepl(">", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
 end <- begin + 1
 ref <- str_extract(temp[[2]], pattern = "[A-Z]")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del_ins

 if(grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
 end <- substring(end, 2, nchar(end)) %>% as.integer()
 ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]+")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 


 # case dup

 if(grepl("dup", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- "-"
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del only

 if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
 alt <- "-"

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 



print(res)

 temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -

answered 7 hours ago

haci

6031 silver badge11 bronze badges

$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago

add a comment
|

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "676"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fbioinformatics.stackexchange.com%2fquestions%2f10454%2fsplit-1-column-input-into-5-column-bed-file%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

The following Python script seems to do the job.

#!/usr/bin/env python3

from __future__ import print_function
import re
import sys

for line in sys.stdin:
 if '_' in line:
 match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
 print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
 elif 'dup' in line or 'del' in line:
 match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
 if match.group(3) == 'del':
 print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
 else:
 print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
 else:
 match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
 coord = int(match.group(2))
 print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

edited 7 hours ago

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

$begingroup$
Thank you very much: print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)
$endgroup$
– justaguy
7 hours ago

1

$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago

1

$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago

add a comment
|

The following Python script seems to do the job.

#!/usr/bin/env python3

from __future__ import print_function
import re
import sys

for line in sys.stdin:
 if '_' in line:
 match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
 print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
 elif 'dup' in line or 'del' in line:
 match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
 if match.group(3) == 'del':
 print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
 else:
 print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
 else:
 match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
 coord = int(match.group(2))
 print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

edited 7 hours ago

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

$begingroup$
Thank you very much: print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)
$endgroup$
– justaguy
7 hours ago

1

$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago

1

$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago

add a comment
|

The following Python script seems to do the job.

#!/usr/bin/env python3

from __future__ import print_function
import re
import sys

for line in sys.stdin:
 if '_' in line:
 match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
 print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
 elif 'dup' in line or 'del' in line:
 match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
 if match.group(3) == 'del':
 print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
 else:
 print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
 else:
 match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
 coord = int(match.group(2))
 print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

edited 7 hours ago

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

The following Python script seems to do the job.

#!/usr/bin/env python3

from __future__ import print_function
import re
import sys

for line in sys.stdin:
 if '_' in line:
 match = re.search(r'(S+):(d+)_(d+)del([ACGT]+)ins([ACGT]+)', line)
 print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t')
 elif 'dup' in line or 'del' in line:
 match = re.search(r'(S+):(d+)(del|dup)([ACGT]+)', line)
 if match.group(3) == 'del':
 print(match.group(1), match.group(2), match.group(2), match.group(4), '-', sep='t')
 else:
 print(match.group(1), match.group(2), match.group(2), '-', match.group(4), sep='t')
 else:
 match = re.search(r'(S+):(d+)([ACGT]+)>([ACGT]+)', line)
 coord = int(match.group(2))
 print(match.group(1), coord - 1, coord, match.group(3), match.group(4), sep='t')

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

edited 7 hours ago

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

edited 7 hours ago

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

answered 8 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

$begingroup$
Thank you very much: print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)
$endgroup$
– justaguy
7 hours ago

1

$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago

1

$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago

add a comment
|

$begingroup$
Thank you very much: print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)
$endgroup$
– justaguy
7 hours ago

1

$begingroup$
It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.
$endgroup$
– Daniel Standage
7 hours ago

1

$begingroup$
"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.
$endgroup$
– haci
7 hours ago

Thank you very much: print(match.group(1), match.group(2), match.group(3), match.group(5), match.group(4), sep='t') ^ SyntaxError: invalid syntax ~$ python --version Python 2.7.9 (not sure if this helps, sorry new to python. :)

– justaguy
7 hours ago

It looks like you're running Python 2, which unfortunately is the default version on many operating systems even though it's at its end-of-life. I'll update the script so that it's compatible with Python 2.

– Daniel Standage
7 hours ago

"print" was a statement in Python 2 but is a function in Python 3 hence the syntax error.

– haci
7 hours ago

add a comment
|

#!/usr/bin/env perl
use strict;

while(<STDIN>) dup)([ACGT]+)/) 
 if ($3 eq "del") 
 print("$1t$2t$2t$4t-n");
 
 else 
 print("$1t$2t$2t-t$4n");
 
 
 elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/) 
 my $coord2 = $2;
 my $coord1 = $2 - 1;
 print("$1t$coord1t$coord2t$3t$4n")

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

add a comment
|

#!/usr/bin/env perl
use strict;

while(<STDIN>) dup)([ACGT]+)/) 
 if ($3 eq "del") 
 print("$1t$2t$2t$4t-n");
 
 else 
 print("$1t$2t$2t-t$4n");
 
 
 elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/) 
 my $coord2 = $2;
 my $coord1 = $2 - 1;
 print("$1t$coord1t$coord2t$3t$4n")

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

add a comment
|

#!/usr/bin/env perl
use strict;

while(<STDIN>) dup)([ACGT]+)/) 
 if ($3 eq "del") 
 print("$1t$2t$2t$4t-n");
 
 else 
 print("$1t$2t$2t-t$4n");
 
 
 elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/) 
 my $coord2 = $2;
 my $coord1 = $2 - 1;
 print("$1t$coord1t$coord2t$3t$4n")

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

#!/usr/bin/env perl
use strict;

while(<STDIN>) dup)([ACGT]+)/) 
 if ($3 eq "del") 
 print("$1t$2t$2t$4t-n");
 
 else 
 print("$1t$2t$2t-t$4n");
 
 
 elsif (m/(S+):(d+)([ACGT]+)>([ACGT]+)/) 
 my $coord2 = $2;
 my $coord1 = $2 - 1;
 print("$1t$coord1t$coord2t$3t$4n")

Invoked like so on the command line.

[standage@lappy ~] $ ./transform < f1
chr7 140453144 140453145 A T
chr7 140453135 140453136 A T
chr7 140453135 140453136 TT CA
chr20 31022287 31022287 - A
chr19 13054614 13054614 G -

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

answered 7 hours ago

Daniel Standage

3,4477 silver badges33 bronze badges

add a comment
|

Here is an R solution, would probably be slower than the suggested Python solution though:

bed <- readLines("bed.txt")

res <- data.frame()

for(i in bed) 

 temp <- strsplit(i, ":") %>% unlist()

 # case N>N

 if(grepl(">", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
 end <- begin + 1
 ref <- str_extract(temp[[2]], pattern = "[A-Z]")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del_ins

 if(grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
 end <- substring(end, 2, nchar(end)) %>% as.integer()
 ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]+")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 


 # case dup

 if(grepl("dup", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- "-"
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del only

 if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
 alt <- "-"

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 



print(res)

 temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -

answered 7 hours ago

haci

6031 silver badge11 bronze badges

$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago

add a comment
|

Here is an R solution, would probably be slower than the suggested Python solution though:

bed <- readLines("bed.txt")

res <- data.frame()

for(i in bed) 

 temp <- strsplit(i, ":") %>% unlist()

 # case N>N

 if(grepl(">", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
 end <- begin + 1
 ref <- str_extract(temp[[2]], pattern = "[A-Z]")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del_ins

 if(grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
 end <- substring(end, 2, nchar(end)) %>% as.integer()
 ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]+")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 


 # case dup

 if(grepl("dup", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- "-"
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del only

 if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
 alt <- "-"

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 



print(res)

 temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -

answered 7 hours ago

haci

6031 silver badge11 bronze badges

$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago

add a comment
|

Here is an R solution, would probably be slower than the suggested Python solution though:

bed <- readLines("bed.txt")

res <- data.frame()

for(i in bed) 

 temp <- strsplit(i, ":") %>% unlist()

 # case N>N

 if(grepl(">", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
 end <- begin + 1
 ref <- str_extract(temp[[2]], pattern = "[A-Z]")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del_ins

 if(grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
 end <- substring(end, 2, nchar(end)) %>% as.integer()
 ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]+")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 


 # case dup

 if(grepl("dup", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- "-"
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del only

 if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
 alt <- "-"

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 



print(res)

 temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -

answered 7 hours ago

haci

6031 silver badge11 bronze badges

Here is an R solution, would probably be slower than the suggested Python solution though:

bed <- readLines("bed.txt")

res <- data.frame()

for(i in bed) 

 temp <- strsplit(i, ":") %>% unlist()

 # case N>N

 if(grepl(">", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer() - 1
 end <- begin + 1
 ref <- str_extract(temp[[2]], pattern = "[A-Z]")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del_ins

 if(grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- str_extract(temp[[2]], pattern = "_[:digit:]+")
 end <- substring(end, 2, nchar(end)) %>% as.integer()
 ref <- str_extract(temp[[2]], pattern = "[A-Z]+$")
 alt <- str_extract(temp[[2]], pattern = "[A-Z]+")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 


 # case dup

 if(grepl("dup", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- "-"
 alt <- str_extract(temp[[2]], pattern = "[A-Z]$")

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 

 # case del only

 if(grepl("del", temp[[2]]) & !grepl("ins", temp[[2]])) 

 begin <- str_extract(temp[[2]], pattern = "^[:digit:]+") %>% as.integer()
 end <- begin
 ref <- str_extract(temp[[2]], pattern = "[A-Z]$")
 alt <- "-"

 temp_res <- data.frame(temp[[1]], begin, end, ref, alt)
 res <- rbind(res, temp_res)
 



print(res)

 temp..1.. begin end ref alt
1 chr7 140453144 140453145 A T
2 chr7 140453135 140453136 A T
3 chr7 140453135 140453136 TT CA
4 chr20 31022287 31022287 - A
5 chr19 13054614 13054614 G -

answered 7 hours ago

haci

6031 silver badge11 bronze badges

answered 7 hours ago

haci

6031 silver badge11 bronze badges

answered 7 hours ago

haci

6031 silver badge11 bronze badges

answered 7 hours ago

haci

6031 silver badge11 bronze badges

$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago

add a comment
|

$begingroup$
Thank you all very much :).
$endgroup$
– justaguy
1 hour ago

Thank you all very much :).

– justaguy
1 hour ago

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Bioinformatics Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mfcttrf