数据清洗
您可能会惊讶地发现本节太晚了!通常,在继续进行更复杂的分析之前,您会仔细查看数据集以解决所有问题。但是,在本教程中,您将依靠在上一节中学到的技术来清理数据集。
缺失值

图片.png
当您使用检查nba
数据集时nba.info()
,您会发现它非常整洁。只有列notes
的大多数行都包含空值。
>>> rows_without_missing_data = nba.dropna()>>> rows_without_missing_data.shape(5424, 23)>>> data_without_missing_columns = nba.dropna(axis=1)>>> data_without_missing_columns.shape(126314, 22)>>> data_with_default_notes = nba.copy()>>> data_with_default_notes["notes"].fillna(... value="no notes at all",... inplace=True... )>>> data_with_default_notes["notes"].describe()count 126314unique 232top no notes at allfreq 120890Name: notes, dtype: object>>> nba.describe()

>>> nba[nba["pts"] == 0]>>> nba[(nba["pts"] > nba["opp_pts"]) & (nba["game_result"] != 'W')].emptyTrue>>> nba[(nba["pts"] < nba["opp_pts"]) & (nba["game_result"] != 'L')].emptyTrue

合并多个数据集
>>> further_city_data = pd.DataFrame(... {"revenue": [7000, 3400], "employee_count":[2, 2]},... index=["New York", "Barcelona"]... )>>> all_city_data = pd.concat([city_data, further_city_data], sort=False)>>> all_city_dataAmsterdam 4200 5.0Tokyo 6500 8.0Toronto 8000 NaNNew York 7000 2.0Barcelona 3400 2.0>>> city_countries = pd.DataFrame({... "country": ["Holland", "Japan", "Holland", "Canada", "Spain"],... "capital": [1, 1, 0, 0, 0]},... index=["Amsterdam", "Tokyo", "Rotterdam", "Toronto", "Barcelona"]... )>>> cities = pd.concat([all_city_data, city_countries], axis=1, sort=False)>>> cities revenue employee_count country capitalAmsterdam 4200.0 5.0 Holland 1.0Tokyo 6500.0 8.0 Japan 1.0Toronto 8000.0 NaN Canada 0.0New York 7000.0 2.0 NaN NaNBarcelona 3400.0 2.0 Spain 0.0Rotterdam NaN NaN Holland 0.0>>> pd.concat([all_city_data, city_countries], axis=1, join="inner") revenue employee_count country capitalAmsterdam 4200 5.0 Holland 1Tokyo 6500 8.0 Japan 1Toronto 8000 NaN Canada 0Barcelona 3400 2.0 Spain 0>>> countries = pd.DataFrame({... "population_millions": [17, 127, 37],... "continent": ["Europe", "Asia", "North America"]... }, index= ["Holland", "Japan", "Canada"])>>> pd.merge(cities, countries, left_on="country", right_index=True)>>> pd.merge(... cities,... countries,... left_on="country",... right_index=True,... how="left"... )


数据可视化
>>> nba[nba["fran_id"] == "Knicks"].groupby("year_id")["pts"].sum().plot()>>> nba["fran_id"].value_counts().head(10).plot(kind="bar")>>> nba[... (nba["fran_id"] == "Heat") &... (nba["year_id"] == 2013)... ]["game_result"].value_counts().plot(kind="pie")



https://www.somebits.com/~nelson/pandas-multiindex-slice-demo.html
文章转载于:https://www.jianshu.com/p/40bb55b5a405
原著是一个有趣的人,若有侵权,请通知删除
还没有人抢沙发呢~