word frequency from file using partial matchHow to divide a list of values by a number in command line?How to count duplicated last columns without removing them?Adding only existing words in a file from a another file and removing the rest (unix)?How can I append an incremental count to every predefined word of a text file?Sorting some lines in a fileCounting occurrences of word in text fileSearching match of multi-line regex in files (without pcregrep)Compare two text files, extract matching rows of file2 plus additional rows
Are Custom Indexes passed on to Sandboxes
Draw a horizontal line from the left margin to the end of centered text
What secular civic space would pioneers build for small frontier towns?
Does "as soon as" imply simultaneity?
How do I set a custom order for folders on Windows 7 and 10?
Examples of "unsuccessful" theories with afterlives
Line segments inside a square
Is it impolite to ask for an in-flight catalogue with no intention of buying?
Find equation of the circle whose diameter is the common chord of two other circles?
Does wetting a beer glass change the foam characteristics?
If an object moving in a circle experiences centripetal force, then doesn't it also experience centrifugal force, because of Newton's third law?
To what extent is it worthwhile to report check fraud / refund scams?
Why are there two fundamental laws of logic?
How do pilots align the HUD with their eyeballs?
My manager quit. Should I agree to defer wage increase to accommodate budget concerns?
Is it more effective to add yeast before or after kneading?
Subverting the emotional woman and stoic man trope
Is it impolite to ask for halal food when traveling to and in Thailand?
What is the meaning of word 'crack' in chapter 33 of A Game of Thrones?
Performance for simple code that converts a RGB tuple to hex string
2000s Animated TV show where teenagers could physically go into a virtual world
Is this Portent-like spell balanced?
Could Apollo astronauts see city lights from the moon?
To change trains = cambiare treno?
word frequency from file using partial match
How to divide a list of values by a number in command line?How to count duplicated last columns without removing them?Adding only existing words in a file from a another file and removing the rest (unix)?How can I append an incremental count to every predefined word of a text file?Sorting some lines in a fileCounting occurrences of word in text fileSearching match of multi-line regex in files (without pcregrep)Compare two text files, extract matching rows of file2 plus additional rows
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to
present in the word tom
. So my expected word count of to
is 2. Is it possible using unix
commands?
text-processing command-line
New contributor
add a comment
|
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to
present in the word tom
. So my expected word count of to
is 2. Is it possible using unix
commands?
text-processing command-line
New contributor
add a comment
|
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to
present in the word tom
. So my expected word count of to
is 2. Is it possible using unix
commands?
text-processing command-line
New contributor
I have a text file like this:
tom
and
jerry
went
to
america
and
england
I want to get the frequency of each word.
When I tried the following command
cat test.txt |sort|uniq -c
I got the following output
1 america
2 and
1 england
1 jerry
1 to
1 tom
1 went
But I need partial matches too. ie, the word to
present in the word tom
. So my expected word count of to
is 2. Is it possible using unix
commands?
text-processing command-line
text-processing command-line
New contributor
New contributor
edited 13 hours ago
terdon♦
143k35 gold badges295 silver badges472 bronze badges
143k35 gold badges295 silver badges472 bronze badges
New contributor
asked 14 hours ago
TweetManTweetMan
1233 bronze badges
1233 bronze badges
New contributor
New contributor
add a comment
|
add a comment
|
4 Answers
4
active
oldest
votes
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
add a comment
|
An awk
approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– TweetMan
7 hours ago
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
7 hours ago
Hmm. then that would be a problem. it may crash the system.
– TweetMan
7 hours ago
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
4 hours ago
add a comment
|
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
add a comment
|
It's not clear to me if the partial matches have to be anchored to the beginning of the line.
Assuming that to be the case, what might speed things up here is the use of binary search via the venerable look
command.
Of course look
needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, look
ing up one word at a time against the sorted file
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b
flag to be passed to look
to force a binary search.
Disk caching of the sorted file could help speed things up even further
add a comment
|
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
TweetMan is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f542850%2fword-frequency-from-file-using-partial-match%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
add a comment
|
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
add a comment
|
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
Here's one way, but it isn't very elegant:
$ sort -u file | while IFS= read -r word; do
printf '%st%sn' "$word" "$(grep -cFe "$word" file)";
done
america 1
and 3
england 1
jerry 1
to 2
tom 1
went 1
edited 7 hours ago
Stéphane Chazelas
335k58 gold badges654 silver badges1031 bronze badges
335k58 gold badges654 silver badges1031 bronze badges
answered 13 hours ago
terdon♦terdon
143k35 gold badges295 silver badges472 bronze badges
143k35 gold badges295 silver badges472 bronze badges
add a comment
|
add a comment
|
An awk
approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– TweetMan
7 hours ago
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
7 hours ago
Hmm. then that would be a problem. it may crash the system.
– TweetMan
7 hours ago
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
4 hours ago
add a comment
|
An awk
approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– TweetMan
7 hours ago
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
7 hours ago
Hmm. then that would be a problem. it may crash the system.
– TweetMan
7 hours ago
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
4 hours ago
add a comment
|
An awk
approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
An awk
approach:
awk '
!x c[$0]; next
for (i in c) if (index($0, i)) c[i]++
ENDfor (i in c) print c[i]"t"i' file x=1 file | sort -k1rn
Which on your input give
3 and
2 to
1 america
1 england
1 jerry
1 tom
1 went
answered 7 hours ago
Stéphane ChazelasStéphane Chazelas
335k58 gold badges654 silver badges1031 bronze badges
335k58 gold badges654 silver badges1031 bronze badges
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– TweetMan
7 hours ago
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
7 hours ago
Hmm. then that would be a problem. it may crash the system.
– TweetMan
7 hours ago
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
4 hours ago
add a comment
|
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– TweetMan
7 hours ago
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
7 hours ago
Hmm. then that would be a problem. it may crash the system.
– TweetMan
7 hours ago
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
4 hours ago
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– TweetMan
7 hours ago
thank you. this command works. if i run this command against a large file around 30gb, will a machine of 8gb ram handle that?
– TweetMan
7 hours ago
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
7 hours ago
@TweetMan depends how many unique words there are. It stores all unique words in memory.
– Stéphane Chazelas
7 hours ago
Hmm. then that would be a problem. it may crash the system.
– TweetMan
7 hours ago
Hmm. then that would be a problem. it may crash the system.
– TweetMan
7 hours ago
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
4 hours ago
Awk isn't safe with large files and it bogs down. You may want to look into loading the data into a SQL database and querying it that way.
– A.Danischewski
4 hours ago
add a comment
|
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
add a comment
|
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
add a comment
|
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
This won't crash the system but it may take a long time to run, since it parses the input multiple times. Assuming the input file is called "in":
sort -u < in | while read w
do
printf "%dt%sn" `grep -c "$w" in` "$w"
done
which on your input got me:
1 america
3 and
1 england
1 jerry
2 to
1 tom
1 went
answered 4 hours ago
sitaramsitaram
1115 bronze badges
1115 bronze badges
add a comment
|
add a comment
|
It's not clear to me if the partial matches have to be anchored to the beginning of the line.
Assuming that to be the case, what might speed things up here is the use of binary search via the venerable look
command.
Of course look
needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, look
ing up one word at a time against the sorted file
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b
flag to be passed to look
to force a binary search.
Disk caching of the sorted file could help speed things up even further
add a comment
|
It's not clear to me if the partial matches have to be anchored to the beginning of the line.
Assuming that to be the case, what might speed things up here is the use of binary search via the venerable look
command.
Of course look
needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, look
ing up one word at a time against the sorted file
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b
flag to be passed to look
to force a binary search.
Disk caching of the sorted file could help speed things up even further
add a comment
|
It's not clear to me if the partial matches have to be anchored to the beginning of the line.
Assuming that to be the case, what might speed things up here is the use of binary search via the venerable look
command.
Of course look
needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, look
ing up one word at a time against the sorted file
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b
flag to be passed to look
to force a binary search.
Disk caching of the sorted file could help speed things up even further
It's not clear to me if the partial matches have to be anchored to the beginning of the line.
Assuming that to be the case, what might speed things up here is the use of binary search via the venerable look
command.
Of course look
needs that its input file be sorted.
Therefore, first create a sorted version of the original file
sort file > file.sorted
Then loop through the original file, look
ing up one word at a time against the sorted file
while read -r word; do
printf "%s %dn" "$word" "$(look -b "$word" file.sorted | wc -l)";
done <file
Some systems don't need the -b
flag to be passed to look
to force a binary search.
Disk caching of the sorted file could help speed things up even further
edited 50 mins ago
answered 59 mins ago
iruvariruvar
13k6 gold badges34 silver badges64 bronze badges
13k6 gold badges34 silver badges64 bronze badges
add a comment
|
add a comment
|
TweetMan is a new contributor. Be nice, and check out our Code of Conduct.
TweetMan is a new contributor. Be nice, and check out our Code of Conduct.
TweetMan is a new contributor. Be nice, and check out our Code of Conduct.
TweetMan is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f542850%2fword-frequency-from-file-using-partial-match%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown