Fastest way to perform complex search on pandas dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow do I get the row count of a pandas DataFrame?How to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame
Preserving culinary oils
What are the slash markings on Gatwick's 08R/26L?
Why does the UK have more political parties than the US?
What does it mean when you think without speaking?
How can I prevent interns from being expendable?
Mother abusing my finances
How did early x86 BIOS programmers manage to program full blown TUIs given very few bytes of ROM/EPROM?
When a current flow in an inductor is interrupted, what limits the voltage rise?
Adding strings in lists together
Why teaching kids Torah is the only forbidden profession for singles Yihud-wise?
What are the problems in teaching guitar via Skype?
If a massive object like Jupiter flew past the Earth how close would it need to come to pull people off of the surface?
Uncommanded roll at high speed
Are UK pensions taxed twice?
Looking after a wayward brother in mother's will
Select row of data if next row contains zero
Lunar orbital rendezvous
What is the difference between nullifying your vote and not going to vote at all?
Different PCB color ( is it different material? )
Improve OR inside INNER JOIN
Why do Russians call their women expensive ("дорогая")?
Could I be denied entry into Ireland due to medical and police situations during a previous UK visit?
Mapping a function f[xi_,xj_] over a list x1, ...., xn with the i < j restriction
Can I install a row of bricks on a slab to support a shed?
Fastest way to perform complex search on pandas dataframe
Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow do I get the row count of a pandas DataFrame?How to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.
Before:
flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555
After search/sort:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.
possible edge case:
What if the data is switched up and our end connections are in the same column?
flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999
After search/sort:
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ
python pandas binary-search-tree
add a comment |
I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.
Before:
flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555
After search/sort:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.
possible edge case:
What if the data is switched up and our end connections are in the same column?
flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999
After search/sort:
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ
python pandas binary-search-tree
Are the connecting flights always adjacent in the data frame?
– Mike
8 hours ago
np.where(condition)
– Dadu Khan
8 hours ago
how aboutdf['flightFrom'].shift() != df['fightTo']?
– IanS
8 hours ago
@Mike the information can be completely random in the DataFrame
– MaxB
8 hours ago
1
@IanS check the values infromNum, fromCodeexpected output, that's what makes this question complex imo.
– Erfan
8 hours ago
add a comment |
I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.
Before:
flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555
After search/sort:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.
possible edge case:
What if the data is switched up and our end connections are in the same column?
flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999
After search/sort:
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ
python pandas binary-search-tree
I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.
Before:
flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555
After search/sort:
flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555
In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.
possible edge case:
What if the data is switched up and our end connections are in the same column?
flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999
After search/sort:
flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999
This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ
python pandas binary-search-tree
python pandas binary-search-tree
edited 4 hours ago
MaxB
asked 8 hours ago
MaxBMaxB
1739
1739
Are the connecting flights always adjacent in the data frame?
– Mike
8 hours ago
np.where(condition)
– Dadu Khan
8 hours ago
how aboutdf['flightFrom'].shift() != df['fightTo']?
– IanS
8 hours ago
@Mike the information can be completely random in the DataFrame
– MaxB
8 hours ago
1
@IanS check the values infromNum, fromCodeexpected output, that's what makes this question complex imo.
– Erfan
8 hours ago
add a comment |
Are the connecting flights always adjacent in the data frame?
– Mike
8 hours ago
np.where(condition)
– Dadu Khan
8 hours ago
how aboutdf['flightFrom'].shift() != df['fightTo']?
– IanS
8 hours ago
@Mike the information can be completely random in the DataFrame
– MaxB
8 hours ago
1
@IanS check the values infromNum, fromCodeexpected output, that's what makes this question complex imo.
– Erfan
8 hours ago
Are the connecting flights always adjacent in the data frame?
– Mike
8 hours ago
Are the connecting flights always adjacent in the data frame?
– Mike
8 hours ago
np.where(condition)
– Dadu Khan
8 hours ago
np.where(condition)
– Dadu Khan
8 hours ago
how about
df['flightFrom'].shift() != df['fightTo']?– IanS
8 hours ago
how about
df['flightFrom'].shift() != df['fightTo']?– IanS
8 hours ago
@Mike the information can be completely random in the DataFrame
– MaxB
8 hours ago
@Mike the information can be completely random in the DataFrame
– MaxB
8 hours ago
1
1
@IanS check the values in
fromNum, fromCode expected output, that's what makes this question complex imo.– Erfan
8 hours ago
@IanS check the values in
fromNum, fromCode expected output, that's what makes this question complex imo.– Erfan
8 hours ago
add a comment |
2 Answers
2
active
oldest
votes
This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d=k: v for d in L for k, v in d.items()
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
Installation networkx
Pip:pip install networkx
Anaconda:conda install -c anaconda networkx
2
great answer! Looked into networkx couple times, will do more now!
– Erfan
8 hours ago
2
@Erfan love the enthusiasm ;) same here(for networkx)
– anky_91
8 hours ago
2
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)
– Erfan
8 hours ago
1
@Erfan ok let me working on it
– WeNYoBen
8 hours ago
1
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology
– Prayson W. Daniel
7 hours ago
|
show 5 more comments
Here's a NumPy solution, which might be convenient in the case performance is relevant:
def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)
Which fo the proposed dataframe yields the expected output:
print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555
This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555
Would this generalize to the case where the flights are randomly distributed over the dataframe?
– Erfan
8 hours ago
Yes,return_indexfromnp.uniquegives the index where a value is repeated. Regardless of its position. @erfan
– yatu
8 hours ago
I think the only problem is//2
– WeNYoBen
7 hours ago
Thanks for pointing out @WeNYoBen, updated the answer!
– yatu
6 hours ago
@yatu I updated the question with a potential edge case.
– MaxB
4 hours ago
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56344082%2ffastest-way-to-perform-complex-search-on-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d=k: v for d in L for k, v in d.items()
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
Installation networkx
Pip:pip install networkx
Anaconda:conda install -c anaconda networkx
2
great answer! Looked into networkx couple times, will do more now!
– Erfan
8 hours ago
2
@Erfan love the enthusiasm ;) same here(for networkx)
– anky_91
8 hours ago
2
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)
– Erfan
8 hours ago
1
@Erfan ok let me working on it
– WeNYoBen
8 hours ago
1
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology
– Prayson W. Daniel
7 hours ago
|
show 5 more comments
This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d=k: v for d in L for k, v in d.items()
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
Installation networkx
Pip:pip install networkx
Anaconda:conda install -c anaconda networkx
2
great answer! Looked into networkx couple times, will do more now!
– Erfan
8 hours ago
2
@Erfan love the enthusiasm ;) same here(for networkx)
– anky_91
8 hours ago
2
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)
– Erfan
8 hours ago
1
@Erfan ok let me working on it
– WeNYoBen
8 hours ago
1
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology
– Prayson W. Daniel
7 hours ago
|
show 5 more comments
This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d=k: v for d in L for k, v in d.items()
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
Installation networkx
Pip:pip install networkx
Anaconda:conda install -c anaconda networkx
This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC
import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')
# create the nx object from pandas dataframe
l=list(nx.connected_components(G))
# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]
# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others
d=k: v for d in L for k, v in d.items()
# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output
Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555
Installation networkx
Pip:pip install networkx
Anaconda:conda install -c anaconda networkx
edited 4 hours ago
answered 8 hours ago
WeNYoBenWeNYoBen
136k84575
136k84575
2
great answer! Looked into networkx couple times, will do more now!
– Erfan
8 hours ago
2
@Erfan love the enthusiasm ;) same here(for networkx)
– anky_91
8 hours ago
2
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)
– Erfan
8 hours ago
1
@Erfan ok let me working on it
– WeNYoBen
8 hours ago
1
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology
– Prayson W. Daniel
7 hours ago
|
show 5 more comments
2
great answer! Looked into networkx couple times, will do more now!
– Erfan
8 hours ago
2
@Erfan love the enthusiasm ;) same here(for networkx)
– anky_91
8 hours ago
2
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)
– Erfan
8 hours ago
1
@Erfan ok let me working on it
– WeNYoBen
8 hours ago
1
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology
– Prayson W. Daniel
7 hours ago
2
2
great answer! Looked into networkx couple times, will do more now!
– Erfan
8 hours ago
great answer! Looked into networkx couple times, will do more now!
– Erfan
8 hours ago
2
2
@Erfan love the enthusiasm ;) same here(for networkx)
– anky_91
8 hours ago
@Erfan love the enthusiasm ;) same here(for networkx)
– anky_91
8 hours ago
2
2
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)
– Erfan
8 hours ago
This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)
– Erfan
8 hours ago
1
1
@Erfan ok let me working on it
– WeNYoBen
8 hours ago
@Erfan ok let me working on it
– WeNYoBen
8 hours ago
1
1
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology
– Prayson W. Daniel
7 hours ago
Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology
– Prayson W. Daniel
7 hours ago
|
show 5 more comments
Here's a NumPy solution, which might be convenient in the case performance is relevant:
def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)
Which fo the proposed dataframe yields the expected output:
print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555
This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555
Would this generalize to the case where the flights are randomly distributed over the dataframe?
– Erfan
8 hours ago
Yes,return_indexfromnp.uniquegives the index where a value is repeated. Regardless of its position. @erfan
– yatu
8 hours ago
I think the only problem is//2
– WeNYoBen
7 hours ago
Thanks for pointing out @WeNYoBen, updated the answer!
– yatu
6 hours ago
@yatu I updated the question with a potential edge case.
– MaxB
4 hours ago
|
show 1 more comment
Here's a NumPy solution, which might be convenient in the case performance is relevant:
def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)
Which fo the proposed dataframe yields the expected output:
print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555
This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555
Would this generalize to the case where the flights are randomly distributed over the dataframe?
– Erfan
8 hours ago
Yes,return_indexfromnp.uniquegives the index where a value is repeated. Regardless of its position. @erfan
– yatu
8 hours ago
I think the only problem is//2
– WeNYoBen
7 hours ago
Thanks for pointing out @WeNYoBen, updated the answer!
– yatu
6 hours ago
@yatu I updated the question with a potential edge case.
– MaxB
4 hours ago
|
show 1 more comment
Here's a NumPy solution, which might be convenient in the case performance is relevant:
def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)
Which fo the proposed dataframe yields the expected output:
print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555
This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555
Here's a NumPy solution, which might be convenient in the case performance is relevant:
def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)
Which fo the proposed dataframe yields the expected output:
print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555
This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555
remove_middle_dest(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555
edited 2 hours ago
answered 8 hours ago
yatuyatu
22.6k42149
22.6k42149
Would this generalize to the case where the flights are randomly distributed over the dataframe?
– Erfan
8 hours ago
Yes,return_indexfromnp.uniquegives the index where a value is repeated. Regardless of its position. @erfan
– yatu
8 hours ago
I think the only problem is//2
– WeNYoBen
7 hours ago
Thanks for pointing out @WeNYoBen, updated the answer!
– yatu
6 hours ago
@yatu I updated the question with a potential edge case.
– MaxB
4 hours ago
|
show 1 more comment
Would this generalize to the case where the flights are randomly distributed over the dataframe?
– Erfan
8 hours ago
Yes,return_indexfromnp.uniquegives the index where a value is repeated. Regardless of its position. @erfan
– yatu
8 hours ago
I think the only problem is//2
– WeNYoBen
7 hours ago
Thanks for pointing out @WeNYoBen, updated the answer!
– yatu
6 hours ago
@yatu I updated the question with a potential edge case.
– MaxB
4 hours ago
Would this generalize to the case where the flights are randomly distributed over the dataframe?
– Erfan
8 hours ago
Would this generalize to the case where the flights are randomly distributed over the dataframe?
– Erfan
8 hours ago
Yes,
return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan– yatu
8 hours ago
Yes,
return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan– yatu
8 hours ago
I think the only problem is
//2– WeNYoBen
7 hours ago
I think the only problem is
//2– WeNYoBen
7 hours ago
Thanks for pointing out @WeNYoBen, updated the answer!
– yatu
6 hours ago
Thanks for pointing out @WeNYoBen, updated the answer!
– yatu
6 hours ago
@yatu I updated the question with a potential edge case.
– MaxB
4 hours ago
@yatu I updated the question with a potential edge case.
– MaxB
4 hours ago
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56344082%2ffastest-way-to-perform-complex-search-on-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Are the connecting flights always adjacent in the data frame?
– Mike
8 hours ago
np.where(condition)
– Dadu Khan
8 hours ago
how about
df['flightFrom'].shift() != df['fightTo']?– IanS
8 hours ago
@Mike the information can be completely random in the DataFrame
– MaxB
8 hours ago
1
@IanS check the values in
fromNum, fromCodeexpected output, that's what makes this question complex imo.– Erfan
8 hours ago