本次分析使用的数据来自"yc_data.csv",该文件包含了 Y Combinator(YC)创业加速器投资的公司详细信息:
首先,我们使用 pandas 库读取 CSV 文件,并查看数据的基本信息:
import pandas as pd
df = pd.read_csv("yc_data.csv")
print(df.head())
输出结果显示,数据集包含17列,分别为:
接下来,我们查看数据的整体情况:
print(df.info())
print(df.isnull().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4586 entries, 0 to 4585
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 batch_idx 4586 non-null int64
1 company_id 4586 non-null int64
2 company_name 4586 non-null object
3 short_description 4432 non-null object
4 long_description 4266 non-null object
5 batch 4586 non-null object
6 status 4586 non-null object
7 tags 4586 non-null object
8 location 4324 non-null object
9 country 4331 non-null object
10 year_founded 3563 non-null float64
11 num_founders 4586 non-null int64
12 founders_names 4586 non-null object
13 team_size 4515 non-null float64
14 website 4585 non-null object
15 cb_url 2540 non-null object
16 linkedin_url 2980 non-null object
dtypes: float64(2), int64(3), object(12)
memory usage: 609.2+ KB
None
...
website 1
cb_url 2046
linkedin_url 1606
dtype: int64
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...
从输出结果可以看出,数据集共有4586行,部分列存在缺失值,如short_description、long_description、location、country、year_founded等。
为了便于后续分析,我们需要对数据进行清洗和预处理。
# 处理缺失值
df['short_description'] = df['short_description'].fillna('No description')
df['year_founded'] = df['year_founded'].fillna(df['year_founded'].median())
df['team_size'] = df['team_size'].fillna(df['team_size'].median())
# 创建一个新列表示公司是否成功(假设Acquired或Active状态为成功)
df['is_successful'] = df['status'].isin(['Acquired', 'Active'])
# 从batch列中提取年份,处理异常情况
def extract_year(batch):
try:
year = batch[-2:] # 提取字符串的最后两个字符
return int('20' + year) # 将年份转换为整数类型
except:
return np.nan
df['batch_year'] = df['batch'].apply(extract_year)
# 查看batch_year列的唯一值,以检查是否还有问题
print(df['batch_year'].unique())
现在我们的数据已经清理完毕,让我们开始探索一些有趣的见解:
status_counts = df['status'].value_counts()
plt.figure(figsize=(10, 6))
status_counts.plot(kind='bar')
plt.title('Distribution of Company Statuses')
plt.xlabel('Status')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
这段代码将生成一个柱状图,显示不同公司状态的分布
batch_counts = df['batch'].value_counts().sort_index()
plt.figure(figsize=(12, 6))
batch_counts.plot(kind='line')
plt.title('Number of Companies per Batch')
plt.xlabel('Batch')
plt.ylabel('Number of Companies')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
这将生成一个折线图,展示每个批次的公司数量变化。
all_tags = [tag for tags in df['tags'] for tag in tags]
tag_counts = pd.Series(all_tags).value_counts().head(20)
plt.figure(figsize=(12, 6))
tag_counts.plot(kind='bar')
plt.title('Top 20 Most Common Tags')
plt.xlabel('Tag')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
这个图表将展示最常见的20个标签。
df['success_rate'] = df.groupby('batch_year')['is_successful'].transform('mean')
plt.figure(figsize=(12, 6))
df.groupby('batch_year')['success_rate'].mean().plot(kind='line')
plt.title('Success Rate Over Time')
plt.xlabel('Year')
plt.ylabel('Success Rate')
plt.tight_layout()
plt.show()
从图中可以看出,YC创业公司的成功率总体呈上升趋势,近年来保持在较高水平。
接下来,我们使用T检验分析不同因素对成功率的影响。首先,我们定义一个函数对给定变量进行T检验:
from scipy import stats
def perform_t_test(variable):
successful_values = df[df['is_successful']][variable]
unsuccessful_values = df[~df['is_successful']][variable]
t_stat, p_value = stats.ttest_ind(successful_values, unsuccessful_values)
print(f"Variable: {variable}")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
print("---")
Variable: year_founded
T-statistic: 4.2584208077988706
P-value: 2.0999812247726262e-05
---
Variable: num_founders
T-statistic: 3.5994256038457904
P-value: 0.0003222920811796079
---
Variable: team_size
T-statistic: 0.2147248161445528
P-value: 0.8299914248081315
---
Variable: batch_year
T-statistic: 27.695299446266723
P-value: 3.067399233387115e-156
---
从输出结果可以看出:
最后,我们尝试使用随机森林模型预测公司的成功率:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
features = ['year_founded', 'num_founders', 'team_size']
X = df[features]
y = df['is_successful']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
feature_importance = pd.DataFrame({'feature': features, 'importance': rf_model.feature_importances_})
feature_importance = feature_importance.sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)
Accuracy: 0.8540305010893247
Classification Report:
precision recall f1-score support
False 0.61 0.51 0.56 164
True 0.90 0.93 0.91 754
accuracy 0.85 918
macro avg 0.75 0.72 0.73 918
weighted avg 0.85 0.85 0.85 918
Feature Importance:
feature importance
2 team_size 0.562311
0 year_founded 0.368118
1 num_founders 0.069571
从输出结果可以看出,随机森林模型在测试集上的准确率为85.4%,表现较好。从特征重要性可以看出,团队规模、成立年份和创始人数量依次对预测结果的贡献最大。
通过对YC创业公司数据的分析,我们得到以下主要结论: