Fastest way to perform complex search on pandas dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow do I get the row count of a pandas DataFrame?How to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame

Preserving culinary oils

What are the slash markings on Gatwick's 08R/26L?

Why does the UK have more political parties than the US?

What does it mean when you think without speaking?

How can I prevent interns from being expendable?

Mother abusing my finances

How did early x86 BIOS programmers manage to program full blown TUIs given very few bytes of ROM/EPROM?

When a current flow in an inductor is interrupted, what limits the voltage rise?

Adding strings in lists together

Why teaching kids Torah is the only forbidden profession for singles Yihud-wise?

What are the problems in teaching guitar via Skype?

If a massive object like Jupiter flew past the Earth how close would it need to come to pull people off of the surface?

Uncommanded roll at high speed

Are UK pensions taxed twice?

Looking after a wayward brother in mother's will

Select row of data if next row contains zero

Lunar orbital rendezvous

What is the difference between nullifying your vote and not going to vote at all?

Different PCB color ( is it different material? )

Improve OR inside INNER JOIN

Why do Russians call their women expensive ("дорогая")?

Could I be denied entry into Ireland due to medical and police situations during a previous UK visit?

Mapping a function f[xi_,xj_] over a list x1, ...., xn with the i < j restriction

Can I install a row of bricks on a slab to support a shed?



Fastest way to perform complex search on pandas dataframe


Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow do I get the row count of a pandas DataFrame?How to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








12















I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.



Before:



flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555


After search/sort:



flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555


In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.



possible edge case:



What if the data is switched up and our end connections are in the same column?



flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999


After search/sort:



flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999


This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ










share|improve this question
























  • Are the connecting flights always adjacent in the data frame?

    – Mike
    8 hours ago











  • np.where(condition)

    – Dadu Khan
    8 hours ago











  • how about df['flightFrom'].shift() != df['fightTo']?

    – IanS
    8 hours ago












  • @Mike the information can be completely random in the DataFrame

    – MaxB
    8 hours ago






  • 1





    @IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

    – Erfan
    8 hours ago


















12















I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.



Before:



flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555


After search/sort:



flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555


In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.



possible edge case:



What if the data is switched up and our end connections are in the same column?



flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999


After search/sort:



flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999


This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ










share|improve this question
























  • Are the connecting flights always adjacent in the data frame?

    – Mike
    8 hours ago











  • np.where(condition)

    – Dadu Khan
    8 hours ago











  • how about df['flightFrom'].shift() != df['fightTo']?

    – IanS
    8 hours ago












  • @Mike the information can be completely random in the DataFrame

    – MaxB
    8 hours ago






  • 1





    @IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

    – Erfan
    8 hours ago














12












12








12


7






I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.



Before:



flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555


After search/sort:



flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555


In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.



possible edge case:



What if the data is switched up and our end connections are in the same column?



flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999


After search/sort:



flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999


This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ










share|improve this question
















I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.



Before:



flightTo flightFrom toNum fromNum toCode fromCode
ABC DEF 123 456 8000 8000
DEF XYZ 456 893 9999 9999
AAA BBB 473 917 5555 5555
BBB CCC 917 341 5555 5555


After search/sort:



flightTo flightFrom toNum fromNum toCode fromCode
ABC XYZ 123 893 8000 9999
AAA CCC 473 341 5555 5555


In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.



possible edge case:



What if the data is switched up and our end connections are in the same column?



flight1 flight2 1Num 2Num 1Code 2Code
ABC DEF 123 456 8000 8000
XYZ DEF 893 456 9999 9999


After search/sort:



flight1 flight2 1Num 2Num 1Code 2Code
ABC XYZ 123 893 8000 9999


This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ







python pandas binary-search-tree






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 4 hours ago







MaxB

















asked 8 hours ago









MaxBMaxB

1739




1739












  • Are the connecting flights always adjacent in the data frame?

    – Mike
    8 hours ago











  • np.where(condition)

    – Dadu Khan
    8 hours ago











  • how about df['flightFrom'].shift() != df['fightTo']?

    – IanS
    8 hours ago












  • @Mike the information can be completely random in the DataFrame

    – MaxB
    8 hours ago






  • 1





    @IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

    – Erfan
    8 hours ago


















  • Are the connecting flights always adjacent in the data frame?

    – Mike
    8 hours ago











  • np.where(condition)

    – Dadu Khan
    8 hours ago











  • how about df['flightFrom'].shift() != df['fightTo']?

    – IanS
    8 hours ago












  • @Mike the information can be completely random in the DataFrame

    – MaxB
    8 hours ago






  • 1





    @IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

    – Erfan
    8 hours ago

















Are the connecting flights always adjacent in the data frame?

– Mike
8 hours ago





Are the connecting flights always adjacent in the data frame?

– Mike
8 hours ago













np.where(condition)

– Dadu Khan
8 hours ago





np.where(condition)

– Dadu Khan
8 hours ago













how about df['flightFrom'].shift() != df['fightTo']?

– IanS
8 hours ago






how about df['flightFrom'].shift() != df['fightTo']?

– IanS
8 hours ago














@Mike the information can be completely random in the DataFrame

– MaxB
8 hours ago





@Mike the information can be completely random in the DataFrame

– MaxB
8 hours ago




1




1





@IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

– Erfan
8 hours ago






@IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

– Erfan
8 hours ago













2 Answers
2






active

oldest

votes


















11














This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC



import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output

Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555


Installation networkx




  • Pip: pip install networkx


  • Anaconda: conda install -c anaconda networkx





share|improve this answer




















  • 2





    great answer! Looked into networkx couple times, will do more now!

    – Erfan
    8 hours ago






  • 2





    @Erfan love the enthusiasm ;) same here(for networkx)

    – anky_91
    8 hours ago






  • 2





    This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

    – Erfan
    8 hours ago







  • 1





    @Erfan ok let me working on it

    – WeNYoBen
    8 hours ago






  • 1





    Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

    – Prayson W. Daniel
    7 hours ago


















6














Here's a NumPy solution, which might be convenient in the case performance is relevant:



def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)



Which fo the proposed dataframe yields the expected output:



print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555


This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:



 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555





share|improve this answer

























  • Would this generalize to the case where the flights are randomly distributed over the dataframe?

    – Erfan
    8 hours ago











  • Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

    – yatu
    8 hours ago












  • I think the only problem is //2

    – WeNYoBen
    7 hours ago











  • Thanks for pointing out @WeNYoBen, updated the answer!

    – yatu
    6 hours ago












  • @yatu I updated the question with a potential edge case.

    – MaxB
    4 hours ago











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56344082%2ffastest-way-to-perform-complex-search-on-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









11














This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC



import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output

Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555


Installation networkx




  • Pip: pip install networkx


  • Anaconda: conda install -c anaconda networkx





share|improve this answer




















  • 2





    great answer! Looked into networkx couple times, will do more now!

    – Erfan
    8 hours ago






  • 2





    @Erfan love the enthusiasm ;) same here(for networkx)

    – anky_91
    8 hours ago






  • 2





    This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

    – Erfan
    8 hours ago







  • 1





    @Erfan ok let me working on it

    – WeNYoBen
    8 hours ago






  • 1





    Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

    – Prayson W. Daniel
    7 hours ago















11














This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC



import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output

Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555


Installation networkx




  • Pip: pip install networkx


  • Anaconda: conda install -c anaconda networkx





share|improve this answer




















  • 2





    great answer! Looked into networkx couple times, will do more now!

    – Erfan
    8 hours ago






  • 2





    @Erfan love the enthusiasm ;) same here(for networkx)

    – anky_91
    8 hours ago






  • 2





    This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

    – Erfan
    8 hours ago







  • 1





    @Erfan ok let me working on it

    – WeNYoBen
    8 hours ago






  • 1





    Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

    – Prayson W. Daniel
    7 hours ago













11












11








11







This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC



import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output

Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555


Installation networkx




  • Pip: pip install networkx


  • Anaconda: conda install -c anaconda networkx





share|improve this answer















This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC



import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other ,
# in a net work graph , they are linked
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict ,
# since every components connected to each other ,
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output

Out[22]:
flightTo flightFrom toNum fromNum toCode fromCode
flightTo
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555


Installation networkx




  • Pip: pip install networkx


  • Anaconda: conda install -c anaconda networkx






share|improve this answer














share|improve this answer



share|improve this answer








edited 4 hours ago

























answered 8 hours ago









WeNYoBenWeNYoBen

136k84575




136k84575







  • 2





    great answer! Looked into networkx couple times, will do more now!

    – Erfan
    8 hours ago






  • 2





    @Erfan love the enthusiasm ;) same here(for networkx)

    – anky_91
    8 hours ago






  • 2





    This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

    – Erfan
    8 hours ago







  • 1





    @Erfan ok let me working on it

    – WeNYoBen
    8 hours ago






  • 1





    Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

    – Prayson W. Daniel
    7 hours ago












  • 2





    great answer! Looked into networkx couple times, will do more now!

    – Erfan
    8 hours ago






  • 2





    @Erfan love the enthusiasm ;) same here(for networkx)

    – anky_91
    8 hours ago






  • 2





    This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

    – Erfan
    8 hours ago







  • 1





    @Erfan ok let me working on it

    – WeNYoBen
    8 hours ago






  • 1





    Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

    – Prayson W. Daniel
    7 hours ago







2




2





great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago





great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago




2




2





@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago





@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago




2




2





This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago






This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago





1




1





@Erfan ok let me working on it

– WeNYoBen
8 hours ago





@Erfan ok let me working on it

– WeNYoBen
8 hours ago




1




1





Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago





Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago













6














Here's a NumPy solution, which might be convenient in the case performance is relevant:



def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)



Which fo the proposed dataframe yields the expected output:



print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555


This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:



 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555





share|improve this answer

























  • Would this generalize to the case where the flights are randomly distributed over the dataframe?

    – Erfan
    8 hours ago











  • Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

    – yatu
    8 hours ago












  • I think the only problem is //2

    – WeNYoBen
    7 hours ago











  • Thanks for pointing out @WeNYoBen, updated the answer!

    – yatu
    6 hours ago












  • @yatu I updated the question with a potential edge case.

    – MaxB
    4 hours ago















6














Here's a NumPy solution, which might be convenient in the case performance is relevant:



def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)



Which fo the proposed dataframe yields the expected output:



print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555


This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:



 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555





share|improve this answer

























  • Would this generalize to the case where the flights are randomly distributed over the dataframe?

    – Erfan
    8 hours ago











  • Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

    – yatu
    8 hours ago












  • I think the only problem is //2

    – WeNYoBen
    7 hours ago











  • Thanks for pointing out @WeNYoBen, updated the answer!

    – yatu
    6 hours ago












  • @yatu I updated the question with a potential edge case.

    – MaxB
    4 hours ago













6












6








6







Here's a NumPy solution, which might be convenient in the case performance is relevant:



def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)



Which fo the proposed dataframe yields the expected output:



print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555


This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:



 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555





share|improve this answer















Here's a NumPy solution, which might be convenient in the case performance is relevant:



def remove_middle_dest(df):
x = df.to_numpy()
# obtain a flat numpy array from both columns
b = x[:,0:2].ravel()
_, ix, inv = np.unique(b, return_index=True, return_inverse=True)
# Index of duplicate values in b
ixs_drop = np.setdiff1d(np.arange(len(b)), ix)
# Indices to be used to replace the content in the columns
replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0)
# Col index of where duplicate value is, 0 or 1
col = (ixs_drop % 2) ^ 1
# 2d array to index and replace values in the df
# index to obtain values with which to replace
t = np.tile([3,5], len(col)).reshape(len(col), -1)
ixs = np.concatenate([col[:,None], t], 1)
# translate indices to row indices
rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
c = np.empty((len(col), 5), dtype=x.dtype)
c[:,::2] = x[rows_drop[:,None], ixs]
c[:,1::2] = x[rows_replace[:,None], [2,4]]
# update dataframe and drop rows
df.iloc[rows_replace, 1:] = c
return df.drop(rows_drop)



Which fo the proposed dataframe yields the expected output:



print(df)
flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555


This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:



 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555






share|improve this answer














share|improve this answer



share|improve this answer








edited 2 hours ago

























answered 8 hours ago









yatuyatu

22.6k42149




22.6k42149












  • Would this generalize to the case where the flights are randomly distributed over the dataframe?

    – Erfan
    8 hours ago











  • Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

    – yatu
    8 hours ago












  • I think the only problem is //2

    – WeNYoBen
    7 hours ago











  • Thanks for pointing out @WeNYoBen, updated the answer!

    – yatu
    6 hours ago












  • @yatu I updated the question with a potential edge case.

    – MaxB
    4 hours ago

















  • Would this generalize to the case where the flights are randomly distributed over the dataframe?

    – Erfan
    8 hours ago











  • Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

    – yatu
    8 hours ago












  • I think the only problem is //2

    – WeNYoBen
    7 hours ago











  • Thanks for pointing out @WeNYoBen, updated the answer!

    – yatu
    6 hours ago












  • @yatu I updated the question with a potential edge case.

    – MaxB
    4 hours ago
















Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago





Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago













Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago






Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago














I think the only problem is //2

– WeNYoBen
7 hours ago





I think the only problem is //2

– WeNYoBen
7 hours ago













Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago






Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago














@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago





@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56344082%2ffastest-way-to-perform-complex-search-on-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

François Viète Contents Biography Work and thought Bibliography See also Notes Further reading External links Navigation menup. 21Google Bookspp. 75–77Google BooksDe thou (from University of Saint Andrews)ArchivedGoogle BooksGoogle BooksGoogle BooksGoogle booksGoogle Bookscc-parthenay.frL'histoire universelle (fr)Universal History (en)ArchivedAdsabs.harvard.eduPagesperso-orange.frArchive.orgChikara Sasaki. Descartes' mathematical thought p.259Google BooksGoogle BooksGoogle Bookspp. 152 and onwardGoogle BooksGoogle BooksScribd.comGoogle Books1257-7979Google BooksGoogle BooksGoogle BooksGoogle BooksGoogle BooksGoogle BooksGallica.bnf.frGoogle BooksGoogle Books"François Viète"Francois Viète: Father of Modern Algebraic NotationThe Lawyer and the GamblerAbout TarporleySite de Jean-Paul GuichardL'algèbre nouvelle"About the Harmonicon"cb120511976(data)1188044800000 0001 0913 5903n82164680ola2013766880073431702w6vt1sb70287374827140948071409480