2
votes

I have a dataframe . Many column names have non ASCII characters and special characters like (), /, +, . (non ascii dots in the middle ) etc and non ascii spaces. This did not happen while reading csv. This happened due to one - hot encoding. (when i converted my categorical variable to numeric columns & category values had non ascii values)

df

Col1/name   Col 2() name    Col3 + name    Col4 ^ยจ name   etc...

Expected output

I want only numbers, underscores and characters in my column names (I only want to change column names not any value in dataframe or rows). This is necessary because some Machine learning algorithms such as lightGBM dont work with non ASCII characters or non ASCII spaces in column names.

Expected output df:

Col1name   Col_2_name    Col3__name    Col4__name   etc...

So replacing space with underscores and removing any non- numeric and non-character in column names .

2
why not cleanse non-ascii chars from the column you're one-hot-encoding first? โ€“ Chris A
There are 1000 columns and 3 lakhs rows..Too much effort โ€“ noob
I only need ascii characters in my column names for Machine learning algos to work โ€“ noob

2 Answers

6
votes

One way using pandas.Series.str.replace and findall:

df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)

Output:

Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []
1
votes

You can use the method replace:

df.columns.str.replace('\s+', '_').str.replace('\W+', '')

Output:

Index(['Col1name', 'Col_2_name', 'Col3__name', 'Col4__name'], dtype='object')

You can remove multiple underscores with str.replace('_{2,}', '_').