TL;DR version:
For the simple case of:
- I have a text column with a delimiter and I want two columns
The simplest solution is:
df[['A', 'B']] = df['AB'].str.split(' ', 1, expand=True)
You must use expand=True
if your strings have a non-uniform number of splits and you want None
to replace the missing values.
Notice how, in either case, the .tolist()
method is not necessary. Neither is zip()
.
In detail:
Andy Hayden's solution is most excellent in demonstrating the power of the str.extract()
method.
But for a simple split over a known separator (like, splitting by dashes, or splitting by whitespace), the .str.split()
method is enough1. It operates on a column (Series) of strings, and returns a column (Series) of lists:
>>> import pandas as pd
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2']})
>>> df
AB
0 A1-B1
1 A2-B2
>>> df['AB_split'] = df['AB'].str.split('-')
>>> df
AB AB_split
0 A1-B1 [A1, B1]
1 A2-B2 [A2, B2]
1: If you're unsure what the first two parameters of .str.split()
do,
I recommend the docs for the plain Python version of the method.
But how do you go from:
- a column containing two-element lists
to:
- two columns, each containing the respective element of the lists?
Well, we need to take a closer look at the .str
attribute of a column.
It's a magical object that is used to collect methods that treat each element in a column as a string, and then apply the respective method in each element as efficient as possible:
>>> upper_lower_df = pd.DataFrame({"U": ["A", "B", "C"]})
>>> upper_lower_df
U
0 A
1 B
2 C
>>> upper_lower_df["L"] = upper_lower_df["U"].str.lower()
>>> upper_lower_df
U L
0 A a
1 B b
2 C c
But it also has an "indexing" interface for getting each element of a string by its index:
>>> df['AB'].str[0]
0 A
1 A
Name: AB, dtype: object
>>> df['AB'].str[1]
0 1
1 2
Name: AB, dtype: object
Of course, this indexing interface of .str
doesn't really care if each element it's indexing is actually a string, as long as it can be indexed, so:
>>> df['AB'].str.split('-', 1).str[0]
0 A1
1 A2
Name: AB, dtype: object
>>> df['AB'].str.split('-', 1).str[1]
0 B1
1 B2
Name: AB, dtype: object
Then, it's a simple matter of taking advantage of the Python tuple unpacking of iterables to do
>>> df['A'], df['B'] = df['AB'].str.split('-', 1).str
>>> df
AB AB_split A B
0 A1-B1 [A1, B1] A1 B1
1 A2-B2 [A2, B2] A2 B2
Of course, getting a DataFrame out of splitting a column of strings is so useful that the .str.split()
method can do it for you with the expand=True
parameter:
>>> df['AB'].str.split('-', 1, expand=True)
0 1
0 A1 B1
1 A2 B2
So, another way of accomplishing what we wanted is to do:
>>> df = df[['AB']]
>>> df
AB
0 A1-B1
1 A2-B2
>>> df.join(df['AB'].str.split('-', 1, expand=True).rename(columns={0:'A', 1:'B'}))
AB A B
0 A1-B1 A1 B1
1 A2-B2 A2 B2
The expand=True
version, although longer, has a distinct advantage over the tuple unpacking method. Tuple unpacking doesn't deal well with splits of different lengths:
>>> df = pd.DataFrame({'AB': ['A1-B1', 'A2-B2', 'A3-B3-C3']})
>>> df
AB
0 A1-B1
1 A2-B2
2 A3-B3-C3
>>> df['A'], df['B'], df['C'] = df['AB'].str.split('-')
Traceback (most recent call last):
[...]
ValueError: Length of values does not match length of index
>>>
But expand=True
handles it nicely by placing None
in the columns for which there aren't enough "splits":
>>> df.join(
... df['AB'].str.split('-', expand=True).rename(
... columns={0:'A', 1:'B', 2:'C'}
... )
... )
AB A B C
0 A1-B1 A1 B1 None
1 A2-B2 A2 B2 None
2 A3-B3-C3 A3 B3 C3
read_table()
orread_fwf()
– zach