使用Pandas和Python探索数据集2

访问DataFrame元素

由于DataFrame由Series对象组成，因此可以使用完全相同的工具来访问其元素。关键的区别是DataFrame的维度更大。可对列使用索引运算符，对行使用.loc和.iloc访问方法。

使用索引运算符

将DataFrame视为字典，其值为Series，那么可以使用索引运算符访问其列：

>>> city_data["revenue"]Amsterdam    4200Tokyo        6500Toronto      8000Name: revenue, dtype: int64>>> type(city_data["revenue"])pandas.core.series.Series

如果列名是字符串，还可以使用带点符号的属性样式访问：

>>> city_data.revenueAmsterdam    4200Tokyo        6500Toronto      8000Name: revenue, dtype: int64>>> toys = pd.DataFrame([...     {"name": "ball", "shape": "sphere"},...     {"name": "Rubik's cube", "shape": "cube"}... ])>>> toys["shape"]0    sphere1      cubeName: shape, dtype: object>>> toys.shape(2, 2)

列名与DataFrame属性或方法名相同不能使用这种方法。生产代码或操作数据（例如定义新列）尽量不要使用属性的方法。

使用.loc和.iloc

与Series相似，DataFrame还提供.loc和.iloc数据访问方法。

>>> city_data.loc["Amsterdam"]revenue           4200.0employee_count       5.0Name: Amsterdam, dtype: float64>>> city_data.loc["Tokyo": "Toronto"]        revenue employee_countTokyo   6500    8.0Toronto 8000    NaN>>> city_data.iloc[1]revenue           6500.0employee_count       8.0Name: Tokyo, dtype: float64

对于DataFrame，数据访问方法.loc和.iloc也接受第二个参数。当第一个参数根据索引选择行时，第二个参数选择列。您可以将这些参数一起使用，以从DataFrame中选择行和列的子集：

用逗号分隔参数，逗号之前表示行，之后表示列。

现在是时候在更大的nba数据集中看到相同的构造了。选择标签5555和5559之间的所有比赛。您只对球队的名称和得分感兴趣，因此也请选择这些元素。展开下面的代码块以查看解决方案：

>>> nba.loc[5555:5559, ["fran_id", "opp_fran", "pts", "opp_pts"]]

scores_5555_5559.ac34be4fb1c1.png

查询数据集

>>> current_decade = nba[nba["year_id"] > 2010]>>> current_decade.shape(12658, 23)>>> games_with_notes = nba[nba["notes"].notnull()] # 等效 games_with_notes = nba[nba["notes"].notna()]>>> games_with_notes.shape(5424, 23)>>> ers = nba[nba["fran_id"].str.endswith("ers")]>>> ers.shape(27797, 23)>>> nba[...     (nba["_iscopy"] == 0) &...     (nba["pts"] > 100) &...     (nba["opp_pts"] > 100) &...     (nba["team_id"] == "BLB")... ]>>> nba[    (nba["_iscopy"] == 0) &    (nba["team_id"].str.startswith("LA")) &    (nba["year_id"]==1992) &    (nba["notes"].notnull())]

图片.png

分组和汇总数据

Series具有二十多种不同的计算描述统计的方法。这里有些例子：

>>> city_revenues.sum()18700>>> city_revenues.max()8000>>> points = nba["pts"]>>> type(points)<class 'pandas.core.series.Series'>>>> points.sum()12976235>>> nba.groupby("fran_id", sort=False)["pts"].sum() fran_idHuskies           3995Knicks          582497Stags            20398Falcons           3797Capitols         22387>>> nba[...     (nba["fran_id"] == "Spurs") &...     (nba["year_id"] > 2010)... ].groupby(["year_id", "game_result"])["game_id"].count()year_id  game_result2011     L              25         W              632012     L              20         W              602013     L              30         W              732014     L              27         W              782015     L              31         W              58Name: game_id, dtype: int64>>> nba[...     (nba["fran_id"] == "Warriors") &...     (nba["year_id"] == 2015)... ].groupby(["is_playoffs", "game_result"])["game_id"].count()is_playoffs  game_result0            L              15             W              671            L               5             W              16

默认情况下，Pandas在对.groupby()的调用过程中对组进行排序。如果您不想排序，请传递sort=False 。此参数可以提高性能。

操作列

>>> df = nba.copy()>>> df.shape(126314, 23)>>> df["difference"] = df.pts - df.opp_pts>>> df.shape(126314, 24)>>> df["difference"].max()68>>> renamed_df = df.rename(...     columns={"game_result": "result", "game_location": "location"}... )>>> renamed_df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 126314 entries, 0 to 126313Data columns (total 24 columns):gameorder      126314 non-null int64...location       126314 non-null objectresult         126314 non-null objectforecast       126314 non-null float64notes          5424 non-null objectdifference     126314 non-null int64dtypes: float64(6), int64(8), object(10)memory usage: 23.1+ MB>>> df.shape(126314, 24)>>> elo_columns = ["elo_i", "elo_n", "opp_elo_i", "opp_elo_n"]>>> df.drop(elo_columns, inplace=True, axis=1)>>> df.shape(126314, 20)

修改数据类型

>>> df.info()>>> df["date_game"] = pd.to_datetime(df["date_game"])>>> df["game_location"].nunique()>>> df["game_location"].value_counts()A    63138H    63138N       38>>> df["game_location"] = pd.Categorical(df["game_location"])>>> df["game_location"].dtypeCategoricalDtype(categories=['A', 'H', 'N'], ordered=False)

categorical与非结构化文本相比，数据具有一些优势。当您指定categorical数据类型时，由于Pandas仅在内部使用唯一值，因此使验证更加容易并节省了大量内存。总值与唯一值的比率越高，您将获得更多的空间节省。

文章转载于:https://www.jianshu.com/p/6565e8870c32

原著是一个有趣的人,若有侵权,请通知删除

本博客所有文章如无特别注明均为原创。
复制或转载请以超链接形式注明转自起风了，原文地址《使用Pandas和Python探索数据集2》