Fastest way to perform complex search on pandas dataframeAdd one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow do I get the row count of a pandas DataFrame?How to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame

Preserving culinary oils

What are the slash markings on Gatwick's 08R/26L?

Why does the UK have more political parties than the US?

What does it mean when you think without speaking?

How can I prevent interns from being expendable?

Mother abusing my finances

How did early x86 BIOS programmers manage to program full blown TUIs given very few bytes of ROM/EPROM?

When a current flow in an inductor is interrupted, what limits the voltage rise?

Adding strings in lists together

Why teaching kids Torah is the only forbidden profession for singles Yihud-wise?

What are the problems in teaching guitar via Skype?

If a massive object like Jupiter flew past the Earth how close would it need to come to pull people off of the surface?

Uncommanded roll at high speed

Are UK pensions taxed twice?

Looking after a wayward brother in mother's will

Select row of data if next row contains zero

Lunar orbital rendezvous

What is the difference between nullifying your vote and not going to vote at all?

Different PCB color ( is it different material? )

Improve OR inside INNER JOIN

Why do Russians call their women expensive ("дорогая")?

Could I be denied entry into Ireland due to medical and police situations during a previous UK visit?

Mapping a function f[xi_,xj_] over a list x1, ...., xn with the i < j restriction

Can I install a row of bricks on a slab to support a shed?

Fastest way to perform complex search on pandas dataframe

Add one row to pandas DataFrameSelecting multiple columns in a pandas dataframeAdding new column to existing DataFrame in Python pandasDelete column from pandas DataFrameHow do I get the row count of a pandas DataFrame?How to iterate over rows in a DataFrame in Pandas?Writing a pandas DataFrame to CSV fileSelect rows from a DataFrame based on values in a column in pandasGet list from pandas DataFrame column headersConvert list of dictionaries to a pandas DataFrame

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC DEF 123 456 8000 8000
 DEF XYZ 456 893 9999 9999
 AAA BBB 473 917 5555 5555
 BBB CCC 917 341 5555 5555

After search/sort:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC XYZ 123 893 8000 9999
 AAA CCC 473 341 5555 5555

In this example I am essentially trying to filter out 'flights' that exist in between end destinations. This should be done by using some sort of drop duplicates method but what leaves me confused is how to handle all of the columns. Would a binary search be the best way to accomplish this? Hints appreciated, trying hard to figure this out.

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1 flight2 1Num 2Num 1Code 2Code
 ABC DEF 123 456 8000 8000
 XYZ DEF 893 456 9999 9999

After search/sort:

flight1 flight2 1Num 2Num 1Code 2Code
 ABC XYZ 123 893 8000 9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

edited 4 hours ago

asked 8 hours ago

MaxB

1739

Are the connecting flights always adjacent in the data frame?

– Mike
8 hours ago

np.where(condition)

– Dadu Khan
8 hours ago

how about df['flightFrom'].shift() != df['fightTo']?

– IanS
8 hours ago

@Mike the information can be completely random in the DataFrame

– MaxB
8 hours ago

1

@IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

– Erfan
8 hours ago

add a comment |

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC DEF 123 456 8000 8000
 DEF XYZ 456 893 9999 9999
 AAA BBB 473 917 5555 5555
 BBB CCC 917 341 5555 5555

After search/sort:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC XYZ 123 893 8000 9999
 AAA CCC 473 341 5555 5555

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1 flight2 1Num 2Num 1Code 2Code
 ABC DEF 123 456 8000 8000
 XYZ DEF 893 456 9999 9999

After search/sort:

flight1 flight2 1Num 2Num 1Code 2Code
 ABC XYZ 123 893 8000 9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

edited 4 hours ago

asked 8 hours ago

MaxB

1739

Are the connecting flights always adjacent in the data frame?

– Mike
8 hours ago

np.where(condition)

– Dadu Khan
8 hours ago

how about df['flightFrom'].shift() != df['fightTo']?

– IanS
8 hours ago

@Mike the information can be completely random in the DataFrame

– MaxB
8 hours ago

1

@IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

– Erfan
8 hours ago

add a comment |

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC DEF 123 456 8000 8000
 DEF XYZ 456 893 9999 9999
 AAA BBB 473 917 5555 5555
 BBB CCC 917 341 5555 5555

After search/sort:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC XYZ 123 893 8000 9999
 AAA CCC 473 341 5555 5555

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1 flight2 1Num 2Num 1Code 2Code
 ABC DEF 123 456 8000 8000
 XYZ DEF 893 456 9999 9999

After search/sort:

flight1 flight2 1Num 2Num 1Code 2Code
 ABC XYZ 123 893 8000 9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

edited 4 hours ago

asked 8 hours ago

MaxB

1739

I am trying to figure out the fastest way to perform search and sort on a pandas dataframe. Below are before and after dataframes of what I am trying to accomplish.

Before:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC DEF 123 456 8000 8000
 DEF XYZ 456 893 9999 9999
 AAA BBB 473 917 5555 5555
 BBB CCC 917 341 5555 5555

After search/sort:

flightTo flightFrom toNum fromNum toCode fromCode
 ABC XYZ 123 893 8000 9999
 AAA CCC 473 341 5555 5555

possible edge case:

What if the data is switched up and our end connections are in the same column?

flight1 flight2 1Num 2Num 1Code 2Code
 ABC DEF 123 456 8000 8000
 XYZ DEF 893 456 9999 9999

After search/sort:

flight1 flight2 1Num 2Num 1Code 2Code
 ABC XYZ 123 893 8000 9999

This case logically shouldn't happen. After all how can you go DEF-ABC and DEF-XYZ? You can't, but the 'endpoints' would still be ABC-XYZ

python pandas binary-search-tree

edited 4 hours ago

asked 8 hours ago

MaxB

1739

edited 4 hours ago

asked 8 hours ago

MaxB

1739

edited 4 hours ago

asked 8 hours ago

MaxB

1739

asked 8 hours ago

MaxB

1739

asked 8 hours ago

MaxB

1739

Are the connecting flights always adjacent in the data frame?

– Mike
8 hours ago

np.where(condition)

– Dadu Khan
8 hours ago

how about df['flightFrom'].shift() != df['fightTo']?

– IanS
8 hours ago

@Mike the information can be completely random in the DataFrame

– MaxB
8 hours ago

1

@IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

– Erfan
8 hours ago

add a comment |

Are the connecting flights always adjacent in the data frame?

– Mike
8 hours ago

np.where(condition)

– Dadu Khan
8 hours ago

how about df['flightFrom'].shift() != df['fightTo']?

– IanS
8 hours ago

@Mike the information can be completely random in the DataFrame

– MaxB
8 hours ago

1

@IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

– Erfan
8 hours ago

Are the connecting flights always adjacent in the data frame?

– Mike
8 hours ago

np.where(condition)

– Dadu Khan
8 hours ago

how about df['flightFrom'].shift() != df['fightTo']?

– IanS
8 hours ago

@Mike the information can be completely random in the DataFrame

– MaxB
8 hours ago

@IanS check the values in fromNum, fromCode expected output, that's what makes this question complex imo.

– Erfan
8 hours ago

add a comment |

2 Answers
2

active

oldest

votes

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
 flightTo flightFrom toNum fromNum toCode fromCode
flightTo 
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555

Installation networkx

Pip: pip install networkx

Anaconda: conda install -c anaconda networkx

edited 4 hours ago

answered 8 hours ago

WeNYoBen

136k84575

2

great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago

2

@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago

2

This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago

1

@Erfan ok let me working on it

– WeNYoBen
8 hours ago

1

Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago

|
show 5 more comments

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
 x = df.to_numpy()
 # obtain a flat numpy array from both columns
 b = x[:,0:2].ravel()
 _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
 # Index of duplicate values in b
 ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
 # Indices to be used to replace the content in the columns
 replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
 # Col index of where duplicate value is, 0 or 1
 col = (ixs_drop % 2) ^ 1
 # 2d array to index and replace values in the df
 # index to obtain values with which to replace
 t = np.tile([3,5], len(col)).reshape(len(col), -1)
 ixs = np.concatenate([col[:,None], t], 1)
 # translate indices to row indices
 rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
 c = np.empty((len(col), 5), dtype=x.dtype)
 c[:,::2] = x[rows_drop[:,None], ixs]
 c[:,1::2] = x[rows_replace[:,None], [2,4]]
 # update dataframe and drop rows
 df.iloc[rows_replace, 1:] = c
 return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555

This approach does not assume any particular order in terms of the rows where the duplicate is, but neither the column on which it is (to cover the edge case described in the question). If we use for instance the following dataframe:

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555

edited 2 hours ago

answered 8 hours ago

yatu

22.6k42149

Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago

Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago

I think the only problem is //2

– WeNYoBen
7 hours ago

Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago

@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f56344082%2ffastest-way-to-perform-complex-search-on-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
 flightTo flightFrom toNum fromNum toCode fromCode
flightTo 
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555

Installation networkx

Pip: pip install networkx

Anaconda: conda install -c anaconda networkx

edited 4 hours ago

answered 8 hours ago

WeNYoBen

136k84575

2

great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago

2

@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago

2

This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago

1

@Erfan ok let me working on it

– WeNYoBen
8 hours ago

1

Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago

|
show 5 more comments

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
 flightTo flightFrom toNum fromNum toCode fromCode
flightTo 
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555

Installation networkx

Pip: pip install networkx

Anaconda: conda install -c anaconda networkx

edited 4 hours ago

answered 8 hours ago

WeNYoBen

136k84575

2

great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago

2

@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago

2

This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago

1

@Erfan ok let me working on it

– WeNYoBen
8 hours ago

1

Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago

|
show 5 more comments

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
 flightTo flightFrom toNum fromNum toCode fromCode
flightTo 
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555

Installation networkx

Pip: pip install networkx

Anaconda: conda install -c anaconda networkx

edited 4 hours ago

answered 8 hours ago

WeNYoBen

136k84575

This is network problem , so we using networkx , notice , here you can have more than two stops , which means you can have some case like NY-DC-WA-NC

import networkx as nx
G=nx.from_pandas_edgelist(df, 'flightTo', 'flightFrom')

# create the nx object from pandas dataframe

l=list(nx.connected_components(G))

# then we get the list of components which as tied to each other , 
# in a net work graph , they are linked 
L=[dict.fromkeys(y,x) for x, y in enumerate(l)]

# then from the above we can create our map dict , 
# since every components connected to each other , 
# then we just need to pick of of them as key , then map with others

d=k: v for d in L for k, v in d.items()

# create the dict for groupby , since we need _from as first item and _to as last item 
grouppd=dict(zip(df.columns.tolist(),['first','last']*3))
df.groupby(df.flightTo.map(d)).agg(grouppd) # then using agg with dict yield your output 

Out[22]: 
 flightTo flightFrom toNum fromNum toCode fromCode
flightTo 
0 ABC XYZ 123 893 8000 9999
1 AAA CCC 473 341 5555 5555

Installation networkx

Pip: pip install networkx

Anaconda: conda install -c anaconda networkx

edited 4 hours ago

answered 8 hours ago

WeNYoBen

136k84575

edited 4 hours ago

answered 8 hours ago

WeNYoBen

136k84575

answered 8 hours ago

WeNYoBen

136k84575

answered 8 hours ago

WeNYoBen

136k84575

2

great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago

2

@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago

2

This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago

1

@Erfan ok let me working on it

– WeNYoBen
8 hours ago

1

Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago

|
show 5 more comments

2

great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago

2

@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago

2

This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago

1

@Erfan ok let me working on it

– WeNYoBen
8 hours ago

1

Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago

great answer! Looked into networkx couple times, will do more now!

– Erfan
8 hours ago

@Erfan love the enthusiasm ;) same here(for networkx)

– anky_91
8 hours ago

This answer deserves to be broken down in more explanation :) (so I can learn from it hehe)

– Erfan
8 hours ago

@Erfan ok let me working on it

– WeNYoBen
8 hours ago

Best answer I have read. Is it possible to edit variables, using information names, instead of letters, and expands the solution. Or best write a post/article on medium(or other place) explaining this methodology

– Prayson W. Daniel
7 hours ago

|
show 5 more comments

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
 x = df.to_numpy()
 # obtain a flat numpy array from both columns
 b = x[:,0:2].ravel()
 _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
 # Index of duplicate values in b
 ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
 # Indices to be used to replace the content in the columns
 replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
 # Col index of where duplicate value is, 0 or 1
 col = (ixs_drop % 2) ^ 1
 # 2d array to index and replace values in the df
 # index to obtain values with which to replace
 t = np.tile([3,5], len(col)).reshape(len(col), -1)
 ixs = np.concatenate([col[:,None], t], 1)
 # translate indices to row indices
 rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
 c = np.empty((len(col), 5), dtype=x.dtype)
 c[:,::2] = x[rows_drop[:,None], ixs]
 c[:,1::2] = x[rows_replace[:,None], [2,4]]
 # update dataframe and drop rows
 df.iloc[rows_replace, 1:] = c
 return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555

edited 2 hours ago

answered 8 hours ago

yatu

22.6k42149

Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago

Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago

I think the only problem is //2

– WeNYoBen
7 hours ago

Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago

@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago

|
show 1 more comment

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
 x = df.to_numpy()
 # obtain a flat numpy array from both columns
 b = x[:,0:2].ravel()
 _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
 # Index of duplicate values in b
 ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
 # Indices to be used to replace the content in the columns
 replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
 # Col index of where duplicate value is, 0 or 1
 col = (ixs_drop % 2) ^ 1
 # 2d array to index and replace values in the df
 # index to obtain values with which to replace
 t = np.tile([3,5], len(col)).reshape(len(col), -1)
 ixs = np.concatenate([col[:,None], t], 1)
 # translate indices to row indices
 rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
 c = np.empty((len(col), 5), dtype=x.dtype)
 c[:,::2] = x[rows_drop[:,None], ixs]
 c[:,1::2] = x[rows_replace[:,None], [2,4]]
 # update dataframe and drop rows
 df.iloc[rows_replace, 1:] = c
 return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555

edited 2 hours ago

answered 8 hours ago

yatu

22.6k42149

Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago

Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago

I think the only problem is //2

– WeNYoBen
7 hours ago

Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago

@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago

|
show 1 more comment

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
 x = df.to_numpy()
 # obtain a flat numpy array from both columns
 b = x[:,0:2].ravel()
 _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
 # Index of duplicate values in b
 ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
 # Indices to be used to replace the content in the columns
 replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
 # Col index of where duplicate value is, 0 or 1
 col = (ixs_drop % 2) ^ 1
 # 2d array to index and replace values in the df
 # index to obtain values with which to replace
 t = np.tile([3,5], len(col)).reshape(len(col), -1)
 ixs = np.concatenate([col[:,None], t], 1)
 # translate indices to row indices
 rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
 c = np.empty((len(col), 5), dtype=x.dtype)
 c[:,::2] = x[rows_drop[:,None], ixs]
 c[:,1::2] = x[rows_replace[:,None], [2,4]]
 # update dataframe and drop rows
 df.iloc[rows_replace, 1:] = c
 return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555

edited 2 hours ago

answered 8 hours ago

yatu

22.6k42149

Here's a NumPy solution, which might be convenient in the case performance is relevant:

def remove_middle_dest(df):
 x = df.to_numpy()
 # obtain a flat numpy array from both columns
 b = x[:,0:2].ravel()
 _, ix, inv = np.unique(b, return_index=True, return_inverse=True)
 # Index of duplicate values in b
 ixs_drop = np.setdiff1d(np.arange(len(b)), ix) 
 # Indices to be used to replace the content in the columns
 replace_at = (inv[:,None] == inv[ixs_drop]).argmax(0) 
 # Col index of where duplicate value is, 0 or 1
 col = (ixs_drop % 2) ^ 1
 # 2d array to index and replace values in the df
 # index to obtain values with which to replace
 t = np.tile([3,5], len(col)).reshape(len(col), -1)
 ixs = np.concatenate([col[:,None], t], 1)
 # translate indices to row indices
 rows_drop, rows_replace = (ixs_drop // 2), (replace_at // 2)
 c = np.empty((len(col), 5), dtype=x.dtype)
 c[:,::2] = x[rows_drop[:,None], ixs]
 c[:,1::2] = x[rows_replace[:,None], [2,4]]
 # update dataframe and drop rows
 df.iloc[rows_replace, 1:] = c
 return df.drop(rows_drop)

Which fo the proposed dataframe yields the expected output:

print(df)
 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
1 DEF XYZ 456 893 9999 9999
2 AAA CCC 473 341 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 893 8000 9999
2 AAA CCC 473 341 5555 5555

 flightTo flightFrom toNum fromNum toCode fromCode
0 ABC DEF 123 456 8000 8000
1 XYZ DEF 893 456 9999 9999
2 AAA BBB 473 917 5555 5555
3 BBB CCC 917 341 5555 5555

remove_middle_dest(df)

flightTo flightFrom toNum fromNum toCode fromCode
0 ABC XYZ 123 456 8000 9999
2 AAA CCC 473 341 5555 5555

edited 2 hours ago

answered 8 hours ago

yatu

22.6k42149

edited 2 hours ago

answered 8 hours ago

yatu

22.6k42149

answered 8 hours ago

yatu

22.6k42149

answered 8 hours ago

yatu

22.6k42149

Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago

Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago

I think the only problem is //2

– WeNYoBen
7 hours ago

Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago

@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago

|
show 1 more comment

Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago

Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago

I think the only problem is //2

– WeNYoBen
7 hours ago

Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago

@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago

Would this generalize to the case where the flights are randomly distributed over the dataframe?

– Erfan
8 hours ago

Yes, return_index from np.unique gives the index where a value is repeated. Regardless of its position. @erfan

– yatu
8 hours ago

I think the only problem is //2

– WeNYoBen
7 hours ago

Thanks for pointing out @WeNYoBen, updated the answer!

– yatu
6 hours ago

@yatu I updated the question with a potential edge case.

– MaxB
4 hours ago

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mfcttrf

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2