导读:1 ml.net例子概要.https://github.com/feiyun0112/machinelearning-samples.zh-cn/tree/master.https://gitee.com/mirrors_feiyun0112/machinelearning-sa
https://github.com/feiyun0112/machinelearning-samples.zh-cn/tree/master
https://gitee.com/mirrors_feiyun0112/machinelearning-samples.zh-cn
根据场景和机器学习问题/任务,官方ML.NET示例被分成多个类别,可通过下表访问:
二元分类 | |
情绪分析C# ** ** F# | ||
垃圾信息检测C# ** ** F# | ||
**信用卡欺诈识别 | ||
(Binary Classification)*C#* ** F# | ||
心脏病预测C# |
多类分类 | |
GitHub Issues 分类C# ** ** F# | ||
鸢尾花分类C# ** ** F# | ||
手写数字识别C# |
建议 | |
产品推荐C# | ||
**电影推荐 | ||
(Matrix Factorization)** C# | ||
**电影推荐 | ||
(Field Aware Factorization Machines)** C# |
回归 | |
价格预测C# ** ** F# | ||
销售预测C# | ||
需求预测C# ** ** F# |
时间序列预测 | |
销售预测C# |
异常情况检测 | |
销售高峰检测 ** *C#* ****** C# | ||
电力异常检测C# | ||
**信用卡欺诈检测 | ||
(Anomaly Detection)** C# |
聚类分析 | |
客户细分C# ** ** F# | ||
鸢尾花聚类C# ** ** F# |
排名 | |
排名搜索引擎结果C# |
计算机视觉 | |
**图像分类训练 | ||
(High-Level API)*****C#* *F#* ** | **图像分类预测 | |
(Pretrained TensorFlow model scoring)*****C#* *F#* ****** C# | **图像分类训练 | |
(TensorFlow Featurizer Estimator)*****C#* ** F# | ||
**对象检测 | ||
(ONNX model scoring)*****C#* ****** C# |
跨领域方案 | |
**Web API上的可扩展模型 | ||
** C# | ||
**Razor Web应用程序上的可扩展模型 | ||
** C# | ||
**Azure Functions上的可扩展模型 | ||
** C# | ||
**Blazor Web应用程序上的可扩展模型 | ||
** C# | ||
**大数据集 | ||
** C# | ||
**使用DatabaseLoader加载数据 | ||
** C# | ||
**使用LoadFromEnumerable加载数据 | ||
** C# | ||
**模型可解释性 | ||
** C# | ||
**导出到ONNX | ||
** C# |
工程目录结构,按照ml.net的使用类别, 以CLI modelbuilder csharp等进行了分类,分别对应ml.net的命令行自动学习;GUI方式的学习;API方式的学习
Vs缺少组件的自动安装
避免 windows路径太长的问题报错可以设置如下:
本地策略组编辑器
计算机配置/管理模板/系统/文件系统
这个问题集中在预测客户的评论是否具有正面或负面情绪。我们将使用小型的wikipedia-detox- datasets(一个用于训练的数据集,一个用于模型的准确性评估的数据集),这些数据集已经由人工处理过,并且每个评论都被分配了一个情绪标签:
我们将使用这些数据集构建一个模型,在预测时将分析字符串并预测情绪值为0或1。
训练数据类似如下:
Label rev_id comment year logged_in ns sample split
0 666674821.0 “ He is a Rapist!!!!! Please edit the article to include this
important fact. Thank You. — Preceding unsigned comment added by • ” 2015 True
article blocked train
0 24297552.0 The other two films Hitch and Magnolia are also directly related
to the community in question, and may be of interest to those who see those
films. So why not link to them? 2005 False article random train
public class SentimentIssue
{
[LoadColumn(0)]
public bool Label { get; set; }
[LoadColumn(2)]
public string Text { get; set; }
}
只选取了数据中的0,2这两个列
IDataView dataView = mlContext.Data.LoadFromTextFile<SentimentIssue>(DataPath, hasHeader: true);
TrainTestData trainTestSplit = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
IDataView trainingData = trainTestSplit.TrainSet;
IDataView testData = trainTestSplit.TestSet;
使用定义的输入数据类型加载数据,并以2-8方式(0.2)切分训练和验证数据
// STEP 2: Common data process configuration with pipeline data transformations
var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: nameof(SentimentIssue.Text));
// STEP 3: Set the training algorithm, then create and config the modelBuilder
var trainer = mlContext.BinaryClassification.Trainers.SdcaLogisticRegression(labelColumnName: "Label", featureColumnName: "Features");
var trainingPipeline = dataProcessPipeline.Append(trainer);
// STEP 4: Train the model fitting to the DataSet
ITransformer trainedModel = trainingPipeline.Fit(trainingData);
dataProcessPipelin变量是一个数据处理管道,通过调用mlContext.Transforms.Text.FeaturizeText方法来创建。这个方法的参数有两个,分别是outputColumnName和inputColumnName。
outputColumnName表示输出的特征列的名称,这里设置为”Features”。
inputColumnName表示输入的文本列的名称,这里使用了nameof(SentimentIssue.Text),它是一个C#语言的语法,表示SentimentIssue.Text的名称。
SdcaLogisticRegression方法创建一个二分类训练器trainer,并设置标签列名为”Label”,特征列名为”Features”。这个训练器使用Sdca算法实现逻辑回归模型。
// STEP 5: Evaluate the model and show accuracy stats
var predictions = trainedModel.Transform(testData);
var metrics = mlContext.BinaryClassification.Evaluate(data: predictions, labelColumnName: "Label", scoreColumnName: "Score");
ConsoleHelper.PrintBinaryClassificationMetrics(trainer.ToString(), metrics);
// STEP 6: Save/persist the trained model to a .ZIP file
mlContext.Model.Save(trainedModel, trainingData.Schema, ModelPath);
Console.WriteLine(“The model is saved to {0}”, ModelPath);
使用测试数据评价模型的训练结果并保存模型
近6000条被分类为“垃圾信息”或“ham”(不是垃圾信息)的消息。
下载的数据类似如下
ham Go until jurong point, crazy.. Available only in bugis n great world la e
buffet… Cine there got amore wat…
ham Ok lar… Joking wif u oni…
spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text
FA to 87121 to receive entry question(std txt rate)T&C’s apply
08452810075over18’s
ham U dun say so early hor… U c already then say…
class SpamInput
{
[LoadColumn(0)]
public string Label { get; set; }
[LoadColumn(1)]
public string Message { get; set; }
}
对应数据中的2列,都是字符串类型
// Create the estimator which converts the text label to boolean, featurizes the text, and adds a linear trainer.
// Data process configuration with pipeline data transformations
var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")
.Append(mlContext.Transforms.Text.FeaturizeText("FeaturesText", new Microsoft.ML.Transforms.Text.TextFeaturizingEstimator.Options
{
WordFeatureExtractor = new Microsoft.ML.Transforms.Text.WordBagEstimator.Options { NgramLength = 2, UseAllLengths = true },
CharFeatureExtractor = new Microsoft.ML.Transforms.Text.WordBagEstimator.Options { NgramLength = 3, UseAllLengths = false },
Norm = Microsoft.ML.Transforms.Text.TextFeaturizingEstimator.NormFunction.L2,
}, "Message"))
.Append(mlContext.Transforms.CopyColumns("Features", "FeaturesText"))
.AppendCacheCheckpoint(mlContext);
由于机器学习默认处理的都是数字,因此文字内容需要处理
将”Label”列的值映射为键值对。
对”Message”列进行文本向量化,使用WordBagEstimator将文本转换为特征向量。WordBagEstimator的选项参数包括:NgramLength:ngram的长度。UseAllLengths:是否使用所有长度的ngram。Norm:特征向量的归一化方式。
复制”FeaturesText”列为”Features”列。
在管道中添加一个缓存检查点。【在机器学习模型的训练过程中,数据处理是非常耗时的操作。缓存检查点是一种优化技术,可以将数据处理的结果缓存起来,以避免在后续训练迭代中重复处理同样的数据。这样可以显著提高训练速度,减少训练时间。在预测过程中也可以使用缓存检查点,以加速数据处理的过程。】
// Set the training algorithm
var trainer =
mlContext.MulticlassClassification.Trainers.OneVersusAll(mlContext.BinaryClassification.Trainers.AveragedPerceptron(labelColumnName:
“Label”, numberOfIterations: 10, featureColumnName: “Features”),
labelColumnName: “Label”)
.Append(mlContext.Transforms.Conversion.MapKeyToValue(“PredictedLabel”,
“PredictedLabel”));
var trainingPipeLine = dataProcessPipeline.Append(trainer);
// Evaluate the model using cross-validation.
// Cross-validation splits our dataset into ‘folds’, trains a model on some
folds and
// evaluates it on the remaining fold. We are using 5 folds so we get back 5
sets of scores.
// Let’s compute the average AUC, which should be between 0.5 and 1 (higher is
better).
Console.WriteLine(“=============== Cross-validating to get model’s accuracy
metrics ===============”);
var crossValidationResults =
mlContext.MulticlassClassification.CrossValidate(data: data, estimator:
trainingPipeLine, numberOfFolds: 5);
ConsoleHelper.PrintMulticlassClassificationFoldsAverageMetrics(trainer.ToString(),
crossValidationResults);
// Now let's train a model on the full dataset to help us get better results
var model = trainingPipeLine.Fit(data);
ProductRecommender.csproj将这个工程手工加入进来
该工程是基于购买的历史,推荐产品相关购买
如下是脱敏的数据
ProductID ProductID_Copurchased
0 1
0 2
0 3
0 4
0 5
1 0
1 2
1 4
1 5
1 15
public class ProductEntry
{
[KeyType(count : 262111)]
public uint ProductID { get; set; }
[KeyType(count : 262111)]
public uint CoPurchaseProductID { get; set; }
}
训练数据中产品类别就是262111个
//STEP 2: Read the trained data using TextLoader by defining the schema for reading the product co-purchase dataset
// Do remember to replace amazon0302.txt with dataset from [https://snap.stanford.edu/data/amazon0302.html](https://snap.stanford.edu/data/amazon0302.html)
var traindata = mlContext.Data.LoadFromTextFile(path:TrainingDataLocation,
columns: new[]
{
new TextLoader.Column("Label", DataKind.Single, 0),
new TextLoader.Column(name:nameof(ProductEntry.ProductID), dataKind:DataKind.UInt32, source: new [] { new TextLoader.Range(0) }, keyCount: new KeyCount(262111)),
new TextLoader.Column(name:nameof(ProductEntry.CoPurchaseProductID), dataKind:DataKind.UInt32, source: new [] { new TextLoader.Range(1) }, keyCount: new KeyCount(262111))
},
hasHeader: true,
separatorChar: '\t');
//STEP 3: Your data is already encoded so all you need to do is specify options for MatrxiFactorizationTrainer with a few extra hyperparameters
// LossFunction, Alpa, Lambda and a few others like K and C as shown below and call the trainer.
MatrixFactorizationTrainer.Options options = new MatrixFactorizationTrainer.Options();
options.MatrixColumnIndexColumnName = nameof(ProductEntry.ProductID);
options.MatrixRowIndexColumnName = nameof(ProductEntry.CoPurchaseProductID);
options.LabelColumnName= "Label";
options.LossFunction = MatrixFactorizationTrainer.LossFunctionType.SquareLossOneClass;
options.Alpha = 0.01;
options.Lambda = 0.025;
// For better results use the following parameters
//options.K = 100;
//options.C = 0.00001;
//Step 4: Call the MatrixFactorization trainer by passing options.
var est = mlContext.Recommendation().Trainers.MatrixFactorization(options);
//STEP 5: Train the model fitting to the DataSet
//Please add Amazon0302.txt dataset from [https://snap.stanford.edu/data/amazon0302.html](https://snap.stanford.edu/data/amazon0302.html) to Data folder if FileNotFoundException is thrown.
ITransformer model = est.Fit(traindata);
Matrix
Factorization训练器是一种用于训练推荐系统的工具。它主要用于处理用户行为数据,并学习用户兴趣和物品之间的潜在关系。这种训练器可以应用于多种推荐场景,如电影推荐、商品推荐等。
Matrix
Factorization训练器通过将用户行为数据表示为一个矩阵,并使用矩阵分解技术对矩阵进行分解,从而挖掘出用户兴趣和物品之间的潜在关系。这种训练器通常采用随机梯度下降(SGD)等优化算法进行训练,并使用正则化技术来防止过拟合。
修改了下原工程代码,多推荐几个:
var pes = new ProductEntry[]
{
new ProductEntry() {
ProductID = 3,
CoPurchaseProductID = 63
},
new ProductEntry() {
ProductID = 3,
CoPurchaseProductID = 20
},
new ProductEntry() {
ProductID = 262108,
CoPurchaseProductID = 262109
},
new ProductEntry() {
ProductID = 262108,
CoPurchaseProductID = 3
}
};
var predictionengine = mlContext.Model.CreatePredictionEngine
foreach (var p in pes)
{
var prediction = predictionengine.Predict(p);
Console.WriteLine($“For ProductID = {p.ProductID} and CoPurchaseProductID =
{p.CoPurchaseProductID}”);
Console.WriteLine(” the predicted score is “ + Math.Round(prediction.Score,
1));
}
可见这个算法偏差不小
机器学习的过程不同的算法差别可能会很大
电影评分,标题,流派等信息
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
recommendation-ratings-train.csv
userId,movieId,rating,timestamp
1,1,4,964982703
1,3,4,964981247
1,6,4,964982224
1,47,5,964983815
public class MovieRating
{
[LoadColumn(0)]
public float userId;
[LoadColumn(1)]
public float movieId;
[LoadColumn(2)]
public float Label;
}
//STEP 3: Transform your data by encoding the two features userId and movieID. These encoded features will be provided as input
// to our MatrixFactorizationTrainer.
var dataProcessingPipeline = mlcontext.Transforms.Conversion.MapValueToKey(outputColumnName: "userIdEncoded", inputColumnName: nameof(MovieRating.userId))
.Append(mlcontext.Transforms.Conversion.MapValueToKey(outputColumnName: "movieIdEncoded", inputColumnName: nameof(MovieRating.movieId)));
//Specify the options for MatrixFactorization trainer
MatrixFactorizationTrainer.Options options = new MatrixFactorizationTrainer.Options();
options.MatrixColumnIndexColumnName = "userIdEncoded";
options.MatrixRowIndexColumnName = "movieIdEncoded";
options.LabelColumnName = "Label";
options.NumberOfIterations = 20;
options.ApproximationRank = 100;
//STEP 4: Create the training pipeline
var trainingPipeLine = dataProcessingPipeline.Append(mlcontext.Recommendation().Trainers.MatrixFactorization(options));
//STEP 5: Train the model fitting to the DataSet
Console.WriteLine("=============== Training the model ===============");
ITransformer model = trainingPipeLine.Fit(trainingDataView);
这个例子将训练和使用分开来写
samples\csharp\end-to-end-apps\Recommendation-
MovieRecommender\MovieRecommender_Model 这个是模型训练
samples\csharp\end-to-end-apps\Recommendation-
MovieRecommender\MovieRecommender 这个是 asp.net core程序
userId,movieId,rating,timestamp
1,1,4,964982703
1,3,4,964981247
1,6,4,964982224
1,47,5,964983815
原始数据使用如下函数切分进行评分归一化,并按照9:1的比例进行切分
/*
* FieldAwareFactorizationMachine the learner used in this example requires the problem to setup as a binary classification problem.
* The DataPrep method performs two tasks:
* 1. It goes through all the ratings and replaces the ratings > 3 as 1, suggesting a movie is recommended and ratings < 3 as 0, suggesting
a movie is not recommended
2. This piece of code also splits the ratings.csv into rating-train.csv and
ratings-test.csv used for model training and testing respectively.
*/
public static void DataPrep()
public class MovieRating
{
[LoadColumn(0)]
public string userId;
[LoadColumn(1)]
public string movieId;
[LoadColumn(2)]
public bool Label;
}
这个和矩阵分解算法是一样的
// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times, it can be slow due to
// expensive featurization and disk operations. When the considered data can fit into memory, a solution is to cache the data in memory. Caching is especially
// helpful when working with iterative algorithms which needs many data passes. Since SDCA is the case, we cache. Inserting a
// cache step in a pipeline is also possible, please see the construction of pipeline below.
trainingDataView = mlContext.Data.Cache(trainingDataView);
Console.WriteLine("=============== Transform Data And Preview ===============", color);
Console.WriteLine();
//STEP 4: Transform your data by encoding the two features userId and movieID.
// These encoded features will be provided as input to FieldAwareFactorizationMachine learner
var dataProcessPipeline = mlContext.Transforms.Text.FeaturizeText(outputColumnName: "userIdFeaturized", inputColumnName: nameof(MovieRating.userId))
.Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "movieIdFeaturized", inputColumnName: nameof(MovieRating.movieId))
.Append(mlContext.Transforms.Concatenate("Features", "userIdFeaturized", "movieIdFeaturized")));
Common.ConsoleHelper.PeekDataViewInConsole(mlContext, trainingDataView, dataProcessPipeline, 10);
// STEP 5: Train the model fitting to the DataSet
Console.WriteLine("=============== Training the model ===============", color);
Console.WriteLine();
var trainingPipeLine = dataProcessPipeline.Append(mlContext.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));
var model = trainingPipeLine.Fit(trainingDataView);
运行这个工程就把模型训练出来并保存成文件了
运行这个工程
初始界面都是静态数据的展示,选择如Ankit查看如下:
点击推荐,显示如下:
MovieRatingPrediction prediction = null;
foreach (var movie in _movieService.GetTrendingMovies)
{
// Call the Rating Prediction for each movie prediction
prediction = _model.Predict(new MovieRating
{
userId = id.ToString(),
movieId = movie.MovieID.ToString()
});
// Normalize the prediction scores for the “ratings” b/w 0 - 100
float normalizedscore = Sigmoid(prediction.Score);
// Add the score for recommendation of each movie in the trending movie list
ratings.Add((movie.MovieID, normalizedscore));
}
对当前的票房电影每个算个推荐的评分
上一篇:软件工程快速入门(下)
下一篇:质量保障体系建设演进案例