So Much Data, So Many Formats: A Conversion Service, Part 2 – DZone Big Data | xxxSo Much Data, So Many Formats: A Conversion Service, Part 2 – DZone Big Data – xxx
菜单

So Much Data, So Many Formats: A Conversion Service, Part 2 – DZone Big Data

九月 10, 2018 - MorningStar

Over a million developers have joined DZone.
So Much Data, So Many Formats: A Conversion Service, Part 2 - DZone Big Data

{{announcement.body}}

{{announcement.title}}

Let’s be friends:

So Much Data, So Many Formats: A Conversion Service, Part 2

DZone’s Guide to

So Much Data, So Many Formats: A Conversion Service, Part 2

In this post, we cover how to create the covert controller, a pipeline for our data, and how to store our data. Let’s get started!

Sep. 28, 18 · Big Data Zone ·

Free Resource

Join the DZone community and get the full member experience.

Join For Free

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

The Convert Controller

Now that we have explained the way the data is represented, we can see how it enters our program, by looking at the main controller.

Our main controller is the ConvertController. It contains the core dispatching logic for our API.

[Route("api/[controller]")] public class ConvertController : Controller {             private IDataRepository _dataRepository;     public ConvertController(IDataRepository dataRepository)     {         _dataRepository = dataRepository;     }      [HttpGet]     public IActionResult Get()     {         IEnumerable<InfoData> files =_dataRepository.All();          return View(files);     }      [HttpGet("{id}/{format?}", Name = "GetData")]     public IActionResult Get(Guid id, string format = "")     {                                 InfoData data = _dataRepository.Get(id);         Pipeline pipeline = new Pipeline();                      if(!String.IsNullOrEmpty(format))             data.Format = Enum.Parse<Format>(format);          string converted = Pipeline.Convert(data.Data, data.Format);          if(String.IsNullOrEmpty(converted))              return NoContent();         else             return Content(converted);     }      [HttpPost]     public IActionResult Post(InputData value)     {                     InfoData result = _dataRepository.Save(value);          return new JsonResult(result);     }      [HttpDelete("{id}")]     public IActionResult Delete(Guid id)     {         _dataRepository.Delete(id);          return Ok();     } }

The whole controller is simple and intuitive. We use dependency injection to pass an IDataRepository object to the controller. Dependency injection is provided by ASP.NET and we are going to setup it later.

The only interesting part is the Get method, where we call our pipeline to convert the file. First, we get the data corresponding to the id from the repository (line 21), at this point the field data.Format contains the original format of the file. If the user provided a value for format, we change data.Format to the specified format.

Whether the format is different, or the original one, we always pass the data and the format we need to the pipeline. By doing this way, all data passes through our pipeline. So, for instance, we could return the data in the same format, but after having performed some operation on it (e.g., we ensure the use of a standard decimal separator).

So, let’s see how the pipeline works.

A Very Simple Pipeline

In this tutorial, we are going to create a very simple pipeline. Our pipeline will:

  • receive data in our internal representation.
  • execute all operations specified in the pipeline.
  • return the data converted in the requested format (if the format can handle the data).

The idea of this tutorial is to create a generic system that could handle converting files between different formats. In order to do that well, we need to have a pipeline that could perform simple cleaning operations on data, like merging values, ordering them, etc.

Discussing the Design

If we were willing to modify the code each time, we could simply create an interface and use delegates to perform custom operations. Something like this:

List<Func<DataItem, DataItem>> operations; operations = new List<Func<DataItem, DataItem>>(); operations.Add(new Func<DataItem, DataItem>(operation));  // later  foreach(var operation in operations) {      operation(data); }

However, this would make the service a bit cumbersome to use, unless we always wanted to perform the same standard operations.

This would be the ideal scenario in which to use a DSL since we have a limited scope (manipulating data) that could be drastically improved with a tool that facilitates the few operations we need to do. However, this is outside the scope of this tutorial.

So, a good alternative solution is to include a way to automatically discover and perform operations defined by the user. There are several potential ways to do that:

An expression interpreter is too simple for the things we want to do. The full Roslyn compiler would be the way to go if we wanted to give users the full power of the language. However, aside from the security risks, it would probably be too much freedom that the user would not need and would require a bit of setup work each time.

So, we opt for the middle ground and use a scripting solution. In practice, we are going to use the scripting engine included in Roslyn, but we are going to set up everything and the user will just add their own scripts.

The Pipeline

We start with the simple Convert function that we have seen the ConvertController call.

public String Convert(DataItem data, Format format) {     IConverter<DataItem> converter = null;                     String converted = String.Empty;                            if(data != null)     {         data = PerformOperations(data);          switch(format)         {             case Format.JSON:                 converter = new ParsingServices.Models.JsonConverter();             break;             case Format.CSV:                 converter = new ParsingServices.Models.CSVConverter();             break;             default:                 break;         }          if(converter.IsValid(data))         {             converted = converter.ToFile(data);         }     }      return converted; }

The function performs all the operations on the data and then, after having checked that the target format can support the data, converts the data into the requested format.

The processing of the operations happens in the method PerformOperations, which might be simpler than what you expect.

public DataItem PerformOperations(DataItem data) {     foreach(var file in Directory.EnumerateFiles($"{Location}{Path.DirectorySeparatorChar}"))     {         GlobalsScript item = new GlobalsScript { Data = data };          var script = CSharpScript.EvaluateAsync<DataItem>(             System.IO.File.ReadAllText(file),             Microsoft.CodeAnalysis.Scripting.ScriptOptions.Default             .WithReferences(typeof(ParsingServices.Models.DataItem).Assembly)                                 .WithImports("System.Collections.Generic")             .WithImports("System.Linq"),             globalsType: item.GetType(), globals: item);           script.Wait();                             data = script.Result;        }                  return data; }

The method collects all operations defined in files inside the proper location and then executes them one-by-one. The operations are specified inside files that could be uploaded, just like the files to be converted are uploaded. In the repository, there is an OperationsController and a couple of methods in thePipeline class to manage the creation of operations, but we do not show it here because that code is elementary.

It all happens with the method EvaluateAsync. This method accepts code as a string, together with an object (globals) that contains the data that the script can access. We also have to specify the assemblies required by the code. This is the critical step that could make our scripting solution fragile. Since it is only here, and not inside the scripts, that we can set up assemblies we have to make sure to include all the assemblies that we will need. This way each script has everything it needs.

We can also use a statement (i.e., the WithImports method) inside the script, but it is handy if we do it here for the ones that we will always need.

We cannot use all C# code inside the scripts, but we can do a fair bit. The following is an example script, which is also included in the repository.

using ParsingServices.Models;  if(Data is DataArray) {                     bool simpleValues = true;     foreach(var v in (Data as DataArray).Values)     {         if(!(v is DataValue))             simpleValues = false;     }      if(simpleValues)     {         (Data as DataArray).Values = (Data as DataArray).Values.OrderByDescending(v => (v as DataValue).Text).ToList();     } }  return Data;

The script orders the values if the item is an array and all the values are simple values (e.g., 5 or hello).

The argument we pass as globals in EvaluateAsync is accessible directly, i.e., we use Data and not globals.Data. A nice thing is that we do not need to wrap the code in a class or method, it is just a sequence of statements.

Storing the Data

Now let’s look at the DataRepository class. This class stores the files that are uploaded to our service. Obviously, there is no need to store a file if we just want to convert it. However, if it makes sense to create a conversion service sense, it is probably useful to automatically serve the converted file when needed. To provide such a feature we want to save the file for simplicity. So we have to upload the file one time and we can request it as needed.

We do not save data on a database but on a directory. Many databases do support storing files, but the simplest approach is enough in this case. We are going to just see the method Save since the rest does not show anything challenging. As always, you can see the whole file in the repository.

public InfoData Save(InputData value) {     var id = Guid.NewGuid();     Directory.CreateDirectory($"{Location}{Path.DirectorySeparatorChar}{id}{Path.DirectorySeparatorChar}");                 IConverter<DataItem> converter = null;                     DataItem data = new DataItem();                  switch(value.Format)     {         case Format.JSON:             converter = new ParsingServices.Models.JsonConverter();         break;         case Format.CSV:             converter = new ParsingServices.Models.CSVConverter();         break;         default:             break;     }      using(FileStream fs = new FileStream($"{Location}{Path.DirectorySeparatorChar}{id}{Path.DirectorySeparatorChar}file.{value.Format}", FileMode.OpenOrCreate)) {                         value.File.CopyTo(fs);                     }      using(FileStream fs = new FileStream($"{Location}{Path.DirectorySeparatorChar}{id}{Path.DirectorySeparatorChar}file.{value.Format}", FileMode.Open)) {         data = converter.FromFile(fs);     }      var infoData = new InfoData() {         Id = id,         Format = value.Format,         LocationFile = $"{Location}{Path.DirectorySeparatorChar}{id}{Path.DirectorySeparatorChar}file.{value.Format}",                 Data = data     };      JsonSerializerSettings settings = new JsonSerializerSettings();     settings.ReferenceLoopHandling = ReferenceLoopHandling.Ignore;     settings.Formatting = Formatting.Indented;      System.IO.File.WriteAllText($"{Location}{Path.DirectorySeparatorChar}{id}{Path.DirectorySeparatorChar}data.json", JsonConvert.SerializeObject(infoData, settings));      return infoData; }

We create a new directory (line 4) corresponding to a new id inside Location (a field of the class whose declaration is not shown). After having created the proper *Converter for the requested format, we both copy the file inside our directory (line 20-22) and create the generic data format from the file itself. Finally, we save the InfoData object inside a data.json file that is going to be next to the uploaded file.

We are not actually going to use the value stored in the Data field that is saved on the data.json file. Instead, when we are requested to load data for a specific id, we simply use the proper*Converter again to recreate the data directly. We store it here for debugging purposes, in case we want to check how a problematic file is converted.

To activate dependency injection for the DataRepository class, we just add a line in the method ConfigureServices of the Startup.cs file.

public void ConfigureServices(IServiceCollection services) {     services.AddMvc();      // we tell ASP.NET that DataRepository implement IDataRepository     services.AddTransient<IDataRepository, DataRepository>(); }

That’s all for Part 2! Tune in Sunday when cover converting data to CSV and JSON data types. 

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,data pipeline ,data storage ,data conversion ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.linkDescription }}

{{ parent.urlSource.name }}

· {{ parent.articleDate | date:’MMM. dd, yyyy’ }} {{ parent.linkDate | date:’MMM. dd, yyyy’ }}


Notice: Undefined variable: canUpdate in /var/www/html/wordpress/wp-content/plugins/wp-autopost-pro/wp-autopost-function.php on line 51