Kappa系数
Kappa系数是基于混淆矩阵的计算得到的模型评价参数。计算公式如下:
系数的值在-1到1之间,系数小于0的话实际上就相当于随机了。
python实现为:
1 | from sklearn.metrics import cohen_kappa_score |
海明距离
海明距离也适用于多分类的问题,简单来说就是衡量预测标签与真实标签之间的距离,取值在0~1之间。距离为0说明预测结果与真实结果完全相同,距离为1就说明模型与我们想要的结果完全就是背道而驰。公式就不贴了(0*0 原谅我太懒),直接来python实例。
1 | from sklearn.metrics import hamming_loss |
杰卡德相似系数
它与海明距离的不同之处在于分母。当预测结果与实际情况完全相符时,系数为1;当预测结果与实际情况完全不符时,系数为0;当预测结果是实际情况的真子集或真超集时,距离介于0到1之间。
我们可以通过对所有样本的预测情况求平均得到算法在测试集上的总体表现情况。
1 | from sklearn.metrics import jaccard_similarity_score |
铰链损失
铰链损失(Hinge loss)一般用来使“边缘最大化”(maximal margin)。损失取值在0~1之间,当取值为0,表示多分类模型分类完全准确,取值为1表明完全不起作用。
1 | from sklearn.metrics import hinge_loss |
案例
1 | import pandas as pd |
1 | df = df[pd.notnull(df['Consumer complaint narrative'])] |
1 | df.info() |
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4569 entries, 1 to 21662
Data columns (total 18 columns):
Date received 4569 non-null object
Product 4569 non-null object
Sub-product 3106 non-null object
Issue 4569 non-null object
Sub-issue 2294 non-null object
Consumer complaint narrative 4569 non-null object
Company public response 2220 non-null object
Company 4569 non-null object
State 4556 non-null object
ZIP code 4556 non-null object
Tags 770 non-null object
Consumer consent provided? 4569 non-null object
Submitted via 4569 non-null object
Date sent to company 4569 non-null object
Company response to consumer 4569 non-null object
Timely response? 4569 non-null object
Consumer disputed? 4568 non-null object
Complaint ID 4569 non-null float64
dtypes: float64(1), object(17)
memory usage: 678.2+ KB
1 | col = ['Product', 'Consumer complaint narrative'] |
1 | df.columns |
Index(['Product', 'Consumer complaint narrative'], dtype='object')
1 | df.columns = ['Product', 'Consumer_complaint_narrative'] |
1 | df['category_id'] = df['Product'].factorize()[0] |
1 | df.head() |
Product | Consumer_complaint_narrative | category_id | |
---|---|---|---|
1 | Credit reporting | I have outdated information on my credit repor... | 0 |
1 | import matplotlib.pyplot as plt |
1 | from sklearn.feature_extraction.text import TfidfVectorizer |
(4569, 12633)
1 | from sklearn.feature_selection import chi2 |
1 | from sklearn.model_selection import train_test_split |
1 | print(clf.predict(count_vect.transform(["This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."]))) |
['Debt collection']
1 | print(clf.predict(count_vect.transform(["I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"]))) |
['Credit reporting']
1 | df[df['Consumer_complaint_narrative'] == "This company refuses to provide me verification and validation of debt per my right under the FDCPA. I do not believe this debt is mine."] |
Product | Consumer_complaint_narrative | category_id | |
---|---|---|---|
12 | Debt collection | This company refuses to provide me verificatio... | 2 |
1 | df[df['Consumer_complaint_narrative'] == "I am disputing the inaccurate information the Chex-Systems has on my credit report. I initially submitted a police report on XXXX/XXXX/16 and Chex Systems only deleted the items that I mentioned in the letter and not all the items that were actually listed on the police report. In other words they wanted me to say word for word to them what items were fraudulent. The total disregard of the police report and what accounts that it states that are fraudulent. If they just had paid a little closer attention to the police report I would not been in this position now and they would n't have to research once again. I would like the reported information to be removed : XXXX XXXX XXXX"] |
Product | Consumer_complaint_narrative | category_id | |
---|---|---|---|
61 | Credit reporting | I am disputing the inaccurate information the ... | 0 |
1 | from sklearn.linear_model import LogisticRegression |
1 | import seaborn as sns |
1 | cv_df.groupby('model_name').accuracy.mean() |
model_name
LinearSVC 0.822890
LogisticRegression 0.792927
MultinomialNB 0.688519
RandomForestClassifier 0.443826
Name: accuracy, dtype: float64
1 | from sklearn.model_selection import train_test_split |
1 | from sklearn.metrics import confusion_matrix |
1 | from IPython.display import display |
'Consumer Loan' predicted as 'Credit reporting' : 10 examples.
Product | Consumer_complaint_narrative | |
---|---|---|
2720 | Consumer Loan | Quoting them, your first loan application, the... |
7091 | Consumer Loan | While reviewing my XXXX credit report, I notic... |
5439 | Consumer Loan | I have been recently checking my credit report... |
12763 | Consumer Loan | We went to buy XXXX cars, and the dealership s... |
13158 | Consumer Loan | I got a 30 day late XX/XX/2017 and it 's repor... |
4134 | Consumer Loan | I took out an instalment loan in the amount XX... |
13848 | Consumer Loan | I was turned down for a loan by Honda Finacial... |
19227 | Consumer Loan | ONEMAIN # XXXX XXXX , IN XXXX ( XXXX ) XXXX Da... |
11258 | Consumer Loan | I have not been given credit for the payments ... |
11242 | Consumer Loan | Reliable Credit falsely submitted an applicati... |
'Debt collection' predicted as 'Credit reporting' : 18 examples.
Product | Consumer_complaint_narrative | |
---|---|---|
18410 | Debt collection | Dear CFPB, I am asking you for assistance to i... |
5262 | Debt collection | XXXX XXXX, XXXX ( This letter describes in det... |
11834 | Debt collection | XXXX XXXX XXXX is reporting negatively on my c... |
19652 | Debt collection | I recently paid of both debts on my credit acc... |
15557 | Debt collection | Never have been a XXXX XXXX customer. I was at... |
4431 | Debt collection | someone tried getting credit information and i... |
15949 | Debt collection | This debt is from account from $ XX/XX/2008 an... |
12475 | Debt collection | In XXXX XXXX, there was an account opened thro... |
13548 | Debt collection | DIVERSIFIELD CONSULTANTS INC HAVE VIOLATED FCR... |
6988 | Debt collection | Also collections refuses to stop reporting to ... |
16498 | Debt collection | They called my son and told him that they are ... |
12028 | Debt collection | Rubin & Rothman LLC ( R & R ) received default... |
7131 | Debt collection | THIS IS FRAUD. I HAVE REQUESTED VERIFICATION A... |
15630 | Debt collection | Barclays Bank Delaware obtained a judgment aga... |
11112 | Debt collection | This account was a joint account with XXXX and... |
16 | Debt collection | This complaint is in regards to Square Two Fin... |
311 | Debt collection | Hunter Warfield has be unable to provide prope... |
15988 | Debt collection | Unknown account, never have been notified and ... |
'Mortgage' predicted as 'Credit reporting' : 6 examples.
Product | Consumer_complaint_narrative | |
---|---|---|
4637 | Mortgage | This complaint is in follow-up to Complaint # ... |
5269 | Mortgage | The attached complaint was initially written t... |
7343 | Mortgage | In 2014, I went to XXXX in order to buy a mobi... |
15048 | Mortgage | Company repeatedly corrects my credit report a... |
861 | Mortgage | Mortgage broker did Credit inquiry on my credi... |
19781 | Mortgage | I am a card carrying XXXX and wanted to see if... |
'Credit card' predicted as 'Credit reporting' : 9 examples.
Product | Consumer_complaint_narrative | |
---|---|---|
18643 | Credit card | I was told this account wiuld be deleted from ... |
18574 | Credit card | This inquiry was n't me |
19868 | Credit card | Capital One/Kohls has been reporting a past du... |
19963 | Credit card | on XX/XX/XXXX my wallet was stolen with all my... |
4706 | Credit card | American Express is reporting an account on my... |
21566 | Credit card | Have disputed the reporting of the status of a... |
13906 | Credit card | I have been the victim of identity theft fraud... |
16853 | Credit card | I have requested XXXX XXXX to run a credit rep... |
10505 | Credit card | I have been working since XXXX 2016 to get a i... |
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-22-9932ab8bdc5b> in <module>()
3 for predicted in category_id_df.category_id:
4 for actual in category_id_df.category_id:
----> 5 if predicted != actual and conf_mat[actual, predicted] >= 6:
6 print("'{}' predicted as '{}' : {} examples.".format(id_to_category[actual], id_to_category[predicted], conf_mat[actual, predicted]))
7 display(df.loc[indices_test[(y_test == actual) & (y_pred == predicted)]][['Product', 'Consumer_complaint_narrative']])
IndexError: index 11 is out of bounds for axis 0 with size 11
1 | model.fit(features, labels) |
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)
1 | from sklearn.feature_selection import chi2 |
# 'Bank account or service':
. Top unigrams:
. bank
. account
. Top bigrams:
. debit card
. overdraft fees
# 'Consumer Loan':
. Top unigrams:
. vehicle
. car
. Top bigrams:
. personal loan
. history xxxx
# 'Credit card':
. Top unigrams:
. card
. discover
. Top bigrams:
. credit card
. discover card
# 'Credit reporting':
. Top unigrams:
. equifax
. transunion
. Top bigrams:
. xxxx account
. trans union
# 'Debt collection':
. Top unigrams:
. debt
. collection
. Top bigrams:
. account credit
. time provided
# 'Money transfers':
. Top unigrams:
. paypal
. transfer
. Top bigrams:
. money transfer
. send money
# 'Mortgage':
. Top unigrams:
. mortgage
. escrow
. Top bigrams:
. loan modification
. mortgage company
# 'Other financial service':
. Top unigrams:
. passport
. dental
. Top bigrams:
. stated pay
. help pay
# 'Payday loan':
. Top unigrams:
. payday
. loan
. Top bigrams:
. payday loan
. pay day
# 'Prepaid card':
. Top unigrams:
. prepaid
. serve
. Top bigrams:
. prepaid card
. use card
# 'Student loan':
. Top unigrams:
. navient
. loans
. Top bigrams:
. student loan
. sallie mae
# 'Virtual currency':
. Top unigrams:
. https
. tx
. Top bigrams:
. money want
. xxxx provider
1 | texts = ["I requested a home loan modification through Bank of America. Bank of America never got back to me.", |
"I requested a home loan modification through Bank of America. Bank of America never got back to me."
- Predicted as: 'Mortgage'
"It has been difficult for me to find my past due balance. I missed a regular monthly payment"
- Predicted as: 'Credit reporting'
"I can't get the money out of the country."
- Predicted as: 'Bank account or service'
"I have no money to pay my tuition"
- Predicted as: 'Debt collection'
"Coinbase closed my account for no reason and furthermore refused to give me a reason despite dozens of request"
- Predicted as: 'Bank account or service'
1 | from sklearn import metrics |
precision recall f1-score support
Credit reporting 0.82 0.82 0.82 288
Consumer Loan 0.83 0.60 0.70 100
Debt collection 0.80 0.91 0.85 359
Mortgage 0.90 0.93 0.92 317
Credit card 0.73 0.77 0.75 165
Other financial service 0.00 0.00 0.00 1
Bank account or service 0.74 0.74 0.74 121
Student loan 0.92 0.83 0.87 111
Money transfers 0.50 0.23 0.32 13
Payday loan 0.75 0.38 0.50 16
Prepaid card 0.67 0.12 0.20 17
avg / total 0.82 0.82 0.81 1508
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1428: UserWarning: labels size, 11, does not match size of target_names, 12
.format(len(labels), len(target_names))
/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1135: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)