Posts Tagged Database

Random CSV Test Data Generator


Introduction

I have been building a piece of software which parallel loads CSV data, using the SqlBulkCopy class, into SQL Server tables. If you have ever loaded data into SQL Server, then you would know that SqlBulkCopy is the fastest way to load data. This little project will be the subject of some subsequent blog posts.

To test the parallel loading software I needed some “junk”, or random, data to load into the tables. I just needs a “bunch” of rows to load from a file, which would go into the table “clean”. I did not want to have data errors, at least in the initial testing of the program. I also wanted enough rows so that I can see that the parallel loading is really happening.

After a quick browse on the internet, nothing jumped up as being a quick way to generate data. So, I wrote this piece of software which generates some dummy data form me. I now present that software here for anyone to take and use (modify to your hearts content).

Why load in Parallel?

If you have ever loaded bulk data into any relational database, you will know that it is a slow process. This slowness comes from a number of causes, but one way to combat the slowness is to load multiple tables at the same time. Also, data files often come in different sizes, resulting different tables taking different amounts of time to load, so having something which will load multiple tables at once is very handy. Having a routine which will load numerous tables, on different threads, at the same time means that you can significantly cut the elapse time for the data loads, when compared to a serial process (believe me it does, I’ve done it before today).

Junk or Random Data

The “junk” data has some very special properties, which include:

  • Only numbers (integers), or strings. I did not want the problems of loading other data types (just yet).
  • The strings should be within the maximum length of the database field. Again, I don’t want data errors (just yet).
  • The strings will be simple. There will be no enclosing ‘”’ characters, and the imbedded ‘,’ is out of scope as well. Again, I just wanted the simplest simple string parsing as possible.
  • I did not care if the data made sense. This program makes really random junk text and numbers.
  • There is no referential integrity between the tables being loaded. That means a key in one needed to go into another table. Again, this removes the possibility of loads failing because of violated integrity constraints,
  • There are minimal integrity constraints on the table. This is a two fold constraint:
    1. There are no keys or indexes on the tables.
    2. I just wanted data which would test the load process, not be slowed down by the database doing housekeeping on the way through.

The Main Program

This is a pretty simple main. There are a couple of points to note:

  • The creating of the SqlConnection is a bit a “hangover” from years of writing code which used Oracle. In the Oracle environment connections cost a lot (a relative comparison to SQL Server) to create, and are more flexible (you can have multiple fetches happening on the one connection – the SqlCommand is better isolated).
  • It just passes off to the Generator to do all of the heavy lifting.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.SqlClient;
using System.Diagnostics;

namespace CSVDataGenerator
{
    /// <summary>
    /// Program class, which does some of the work
    /// </summary>
    class Program
    {
        /// <summary>
        /// Where the output files are written
        /// </summary>
        private const string outputDir =
@"C:\Users\Craig\Documents\Visual Studio 10\Projects\ParallelDataLoad\LoadData\Small";
        /// <summary>
        /// XML Control file which currently sets the number of output rows in the
        /// generated output files
        /// </summary>
        private const string controlFile =
@"C:\Users\Craig\Documents\Visual Studio 10\Projects\ParallelDataLoad\LoadData\Small\ControlFile.xml";
        /// <summary>
        /// Connect string to the db which contains the tables.
        /// </summary>
        private const string connectString =
"Data Source=CRAIG-PC-1;Initial Catalog=LoadContainer;Integrated Security=True;Pooling=False";
        /// <summary>
        /// Main method.
        /// 
        /// args">Unused, at present</param>
        static void Main(string[] args)
        {
            var dataGenerator = new DataGenerator(outputDir, controlFile);
            using (SqlConnection con = new SqlConnection(connectString))
            {
                try
                {
                    con.Open();
                    dataGenerator.Generate(con);
                }
                // Naughty, but it is a programmer tool.
                // I just want to know if any exceptions are generated
                catch (Exception ex)
                {
                    Debug.WriteLine(ex);
                }
                finally
                {
                    con.Close();
                }
            }
        }
    }
}

Column Type Enumeration

This is very simple, just number and string types. I use this, in part, because of the code completion snippet for the switch statement, which puts in the case labels from the enumeration.

/// <summary>
/// Simple enum to keep the types which this routine will create
/// </summary>
internal enum ColumnType
{
    String, Number
}

Column Information Class

This is just a simple container class for the column details which are read from the database. This could be a struct , but I have arguments when at times. I should spend some time getting to know how to use the struct type more effectively, but that can wait.

/// <summary>
/// Simple data class (could be a struct but I have arguments with them at times)
/// Just holds the key values for a columns being generated.
/// </summary>
internal class ColumnInfo
{
    private string _coumnName;
    private string _columnType;
    private ColumnType _colTypeEnum;
    private int _columnLength;

    public int ColumnLength
    {
        get { return _columnLength; }
        set { _columnLength = value; }
    }

    internal ColumnType ColTypeEnum
    {
        get { return _colTypeEnum; }
        set { _colTypeEnum = value; }
    }

    public string ColumnType
    {
        get { return _columnType; }
        set { _columnType = value; }
    }

    public string CoumnName
    {
        get { return _coumnName; }
        set { _coumnName = value; }
    }
    public ColumnInfo(string colName, string colType, int colLength)
    {
        this._coumnName = colName;
        this._columnType = colType;
        this._columnLength = colLength;
        if (_columnType == "numeric")
            _colTypeEnum = CSVDataGenerator.ColumnType.Number;
        else
            _colTypeEnum = CSVDataGenerator.ColumnType.String;
    }
}

Data Generator Class

This is the class where all of the “heavy lifting” is done. A couple of key point to note:

  • The XML Control File (see sample below) is set up to define the one size for all output files. Extending this xml file to define the number of rows generated by the table is entirely possible. This feature was not something which I need at this point.
  • You do need the tables created in the database prior to using this. Not a big issue in my mind. But, does slow some uses down.
  • Extending the data types supported by this process, would not be a big issue.
    internal class DataGenerator
    {
        /// <summary>
        /// Private string for where to write the output files
        /// </summary>
        private string outputDir;
        /// <summary>
        /// The path and filename form the control file which will be used
        /// </summary>
        private string controlFile;
        /// <summary>
        /// Number of row to be created, read from the control file
        /// </summary>
        private int RowsRequired;

        /// <summary>
        /// Main constructor for the class.
        /// Just stores the two pieces of file information
        /// 
        /// outputDir">The director which the output will be written to
        /// controlFile">The location and name of the control file which will be used</param>
        public DataGenerator(string outputDir, string controlFile)
        {
            // TODO: Complete member initialization
            this.outputDir = outputDir;
            this.controlFile = controlFile;
        }
        /// <summary>
        /// Default constructor, not call. Probably should throw an exception.
        /// </summary>
        internal void Generate()
        {
            throw new NotImplementedException();
        }

        /// <summary>
        /// Main loop
        /// Reads the control file
        /// Finds all tables
        /// Finds all columns for each table
        /// Generates the random data
        /// </summary>
        /// <param name="con">SQLConnect used to find all the tables</param>
        internal void Generate(SqlConnection con)
        {
            ReadControlFile();
            using (SqlCommand cmd = new SqlCommand(
                "select table_name from INFORMATION_SCHEMA.TABLES", con))
            {
                using (SqlDataReader rdr = cmd.ExecuteReader())
                {
                    while (rdr.Read())
                    {
                        string tableName = rdr.GetString(0);
                        var ColumnMap = ReadColumns(con, tableName);
                        BuildTestData(ColumnMap, RowsRequired, tableName, outputDir);
                    }
                }
            }
        }
        /// <summary>
        /// Writes the output file, with the number of rows required
        /// 
        /// ColumnMap">The columns, in order
        /// RowsRequired">The number of rows to write. An argument so that different length outputs could be produced
        /// tableName">Table Name from the Database, used to create the output file name</param>
        /// <param name="outputDir">The directory that the files are written to.</param>
        private void BuildTestData(List<ColumnInfo> ColumnMap, int RowsRequired, string tableName, string outputDir)
        {
            string writeFile = outputDir + "\\" + tableName + ".csv";
            // Skip if already exists. Handy if you don't want to make all of the files.
            if (File.Exists(writeFile))
                return;
            // Random number generator. Initialised with the most random thing I can easily find.
            Random random = new Random((int)DateTime.Now.Ticks);
            // Lots of appending, use a StringBuilder for better performance.
            StringBuilder line = new StringBuilder();
            // Using the write, does the clean up for me.
            using (StreamWriter writer = new StreamWriter(writeFile))
            {
                for (int i = 0; i < RowsRequired; i++)
                {
                    // First column written for the row does not need the ',' first
                    bool bFirst = true;
                    foreach (var col in ColumnMap)
                    {
                        if (bFirst)
                            bFirst = false;
                        else
                            line.Append(",");
                        // Make some data type sensitive random data
                        switch (col.ColTypeEnum)
                        {
                            case ColumnType.String:
                                line.Append(BuildRandomString(col.ColumnLength, random));
                                break;
                            case ColumnType.Number:
                                line.Append(random.Next());
                                break;
                            default:
                                break;
                        }
                    }
                    // write the line to the file
                    writer.WriteLine(line.ToString());
                    // Empty the StringBuilder, saves making a new one each line.
                    line.Clear();
                }
            }
        }
        /// <summary>
        /// Builds a string of "normal" ASCII characters, minus '"' and ','
        /// 
        /// maxLength">The maximum length of the string column
        /// rnd">The Random object to use</param>
        /// <returns></returns>
        private string BuildRandomString(int maxLength, Random rnd)
        {
            // Random length string up to max lenght
            int length = rnd.Next(0, maxLength);
            // StringBuilder, because lots of appends here.
            StringBuilder rndTxt = new StringBuilder();
            for (int i = 0; i < length; i++)
            {
                char nextChar = Convert.ToChar(rnd.Next(32, 126));
                while(nextChar == ',' || nextChar == '"')
                    nextChar = Convert.ToChar(rnd.Next(32, 126));
                rndTxt.Append(nextChar);
            }
            return rndTxt.ToString();
        }
        /// <summary>
        /// Reads the columns from the table, and builds the internal column list.
        /// </summary>
        /// <param name="con">SqlConnection which is used to provide the connection string to be used
        /// tableName">The table we want the columns from</param>
        /// <returns>List of column information objects for the table</returns>
        private List<ColumnInfo> ReadColumns(SqlConnection con, string tableName)
        {
            List<ColumnInfo> results = null;
            // using SqlConnection, lets the framework do the clean up
            using (SqlConnection con1 = new SqlConnection(con.ConnectionString))
            {
                con1.Open();
                // SQL Statement, order by column ordinal, keeps the columns in order
                using (SqlCommand cmd = new SqlCommand(
@"select column_name, data_type, character_maximum_Length, ORDINAL_POSITION " +
"from INFORMATION_SCHEMA.COLUMNS where table_name = @1 ORDER BY ORDINAL_POSITION", con1))
                {
                    // Bind the table name into the statement
                    SqlParameter param1 = new SqlParameter("1", tableName);
                    cmd.Parameters.Add(param1);
                    using (SqlDataReader rdr = cmd.ExecuteReader())
                    {
                        while (rdr.Read())
                        {
                            if (results == null)
                                results = new List<ColumnInfo>();
                            string columnName = rdr.GetString(0);
                            string columnType = rdr.GetString(1);
                            switch (columnType)
                            {
                                case "numeric":
                                    results.Add(new ColumnInfo(columnName, columnType, 0));
                                    break;
                                case "nvarchar":
                                    int columnLength = rdr.GetInt32(2);
                                    results.Add(new ColumnInfo(columnName, columnType, columnLength));
                                    break;
                                default:
                                    break;
                            }
                        }
                    }
                    return results;
                }
            }
        }
        /// <summary>
        /// Reads the control file and pulls out the number of rows to write
        /// </summary>
        private void ReadControlFile()
        {
            XDocument xDoc = XDocument.Load(controlFile);
            string rows = (from a in xDoc.Elements("Control").Elements("FileSize").Attributes("Rows")
                            select a.Value).First();
            RowsRequired = Convert.ToInt32(rows);
        }
    }

Conclusions

This routine is “dumb as a box of hammers”, and only designed to generate random data to be loaded to Sql Server. But, for a quick set of data to test with it does the job.

I hope you find this useful. Like much of what I post here, some assembly is required. But copy and paste coding, is far simpler (and quicker) than writing something from scratch. Please feel free to extend this shell in any way you like.

Advertisements

, , , , , , ,

1 Comment

Some LINQ Performance “Rules of thumb”


Use indexed classes

Use indexed classes, or more precisely use structures which support keyed access (Dictionary Class and Lookup Class). The SQL analogy of these structures being like indexed tables, in comparison to List which behaves like an unindexed SQL table, seems to hold true with LINQ.

There is a tipping point with both SQL and LINQ, where the cost of establishing the indexed storage is less than the performance improvement you gain from using it. With LINQ you will have to find that through experimentation with your own use. Use the Stopwatch Class and measure the impact of one strategy over the other.

The performance impact of using indexed data structures is particularly marked when you doing joins (equijoin) between sequences in LINQ think seriously about feeding the sequences into a structure which supports a key, and then joining on the keys.

Use the Join Syntax When Performing an Equijoin

Use the join syntax in LINQ. By inspections/divination this  syntax

var seed = new List<XElement>();
var seq1 = seed.Select(A => new { key = A.Name, value = A })
    .ToLookup(A => A.key, A => A.value);
var seq2 = seed.Select(A => new { key = A.Name, value = A })
    .ToLookup(A => A.key, A => A.value);
var example1 = from a in seq1
               join b in seq2 on a.Key equals b.Key
               select a;

is “better” (can be significantly faster) than syntax

var seed = new List<XElement>();
var seq1 = seed.Select(A => new { key = A.Name, value = A })
    .ToLookup(A => A.key, A => A.value);
var seq2 = seed.Select(A => new { key = A.Name, value = A })
    .ToLookup(A => A.key, A => A.value);
var example2 = from a in seq1
               from b in seq2
               where a.Key == b.Key
               select a;

I am not sure what is happening under the hood, but the execution elapse times are far better when using the join syntax. Part of the improvement probably comes from the fact that in the second version the string “==” operator is firing multiple time (that’s what the Visual Studio Profiler said).

Beware of Implicit Constructors Firing Multiple Times

This one is another which the Visual Studio Profiler highlighted. The following syntax will fire the XName constructor multiple times. The

var seed = new List<XElement>();
// NB, both == and the Attribute
var example3 = from a in seed
               where a.Name == "Fred" && a.Attribute("Fred").Value == "Value"
               select a;

The following is a very simple optimisation, but one which saves “a bucket load” of CPU cycles (multiple calls to the XName constructor).

var seed = new List<XElement>();
// Form the XName once, and use it multiple time
XName Fred = "Fred";
var example3 = from a in seed
               where a.Name == Fred && a.Attribute(Fred).Value == "Value"
               select a;

Beware of || in Where Clauses

This one comes straight from my experience with SQL (both Oracle and SQL Server), and translates into what I’ve observed in the execution of LINQ statements. The or operator (||) in LINQ where clauses can cause very slow execution. .

For simple cases you can get a performance boost just by rewriting the LINQ statement as a Union between the two different sides of the or. I’ll  leave it to you to verify the result you get out are the same. Also, this is another optimisation which using the Stopwatch Class to test the performance of a before and after case. For example:

var seed = new List<XElement>();
XName Fred = "Fred";
XName Joe = "Joe";
var example4 = from a in seed
               where a.Attribute(Fred).Value == "test1" || a.Attribute(Joe).Value == "test2"
               select a;

could be rewritten as:

var example5 = (from a in seed
                where a.Attribute(Fred).Value == "test1"
                select a)
                    .Union(from a in seed
                           where a.Attribute(Joe).Value == "test2"
                           select a);

Conclusion

That’s 4 lessons I’ve learned in the last couple of days performance tuning my  DGML generating and manipulating program. The Stopwatch Class and Visual Studio Profiler have been invaluable in identifying the performance bottlenecks, or hot spots, in the system. The judicious redesign of some of the sequences to use Dictionary and Lookup classes, and joining the sequences on the keys of those collections, yielded big improvements in the performance.

LINQ seems to be “balanced on a knife edge” at time. Minor redesigns can result in the execution time going from minutes to seconds in execution time.

, , , , , , , , , , , , ,

Leave a comment

LINQ to SQL Use static CompiledQuery.Compile and static DataContext for BIG Performance Improvements


Introduction

This blog post is describes some of the steps I took to boost the performance of LINQ to SQL an applications (very significantly).

This “train of thought” was prompted by a blog post here on WordPress “LINQ To SQL Very Slow Performance Without Compile (CompileQuery)”. Which “got me thinking” about the data manipulation program I’ve been developing which transforms unstructured text (following a sort of pattern) into structured data, which I’m rendering using DGML (see: “Introduction to Directed Graph Markup Language ( DGML )” for a brief introduction to DGML). I probably post some of the details for generating DGML graphs on this blog (later). This program is using LINQ to SQL to do a bunch of lookups a various points. It had become a bit of a slow process, so was a perfect candidate for a “bit of a tune up” performance wise, and for the applications of some CompiledQuery.Compile enhancements.

The Implementation

There were a couple of steps in the implementation:

  • Move the declaration, and initialisation of the LINQ to SQL DataContext object (which is derived from the System.Data.Linq.DataContext Class ) from the method to a private static field on the class.
    private class DataElementsProcessor
    {
        private static DataElementsDataContext DEctx = new DataElementsDataContext();
  • Next create the declaration of the compiled query. For those reading this who are not familiar with parameterised classes and lambda expressions, I’ll try and explain what is happening in the declaration following the code snippets.
    • What is being replaced (the original LINQ to SQL query statement):
      var found = from de in DEctx.Data_Elements
                  where de.DE_Name == name
                  select de;
    • The replacement declaration:
    • private static Func<DataElementsDataContext, string, IEnumerable>
          GetDeByName = CompiledQuery.Compile((DataElementsDataContext DEctx, string name) =>
                          from de in DEctx.Data_Elements
                          where de.DE_Name == name
                          select de);
    • To explain the declaration:
      • private static: private because only this class should use this, static because I only want one of these available to all instances of the class.
      • Func> : This declares the type which the compiled LINQ to SQL Expression is. It encapusulates a delegate or lambda expression which takes the first two types as parameters and returns the sequence of DataELement objects.
      • GetDeByName: is the name of the delegate being created.
      • CompiledQuery.Compile: the method on which compiles the LINQ to SQL query (which takes the types identified in the Func< part of the declaration) and returned the compiled delegate.
      • (DataElementsDataContext DeCtx, string name) => : Identifies the types and names of the arguments to lambda expression.
      • from .. : the LINQ to SQL query which will be compiled.

Calling the Compiled Version

This is where it gets “tricky” (I’ll cover the “tricky” bits after the simple cases). The simple use cases are straight replacements:

  • Replace the query.
  • //var found = from de in DEctx.Data_Elements
    //            where de.DE_Name == name
    //            select de;
    var found = GetDeByName(DEctx, name);
  • Replace a bit more:
    • Which I think “works” a bit better. There is now no need for the variable “found” in the segment of code, and just testing the Any() on the sequence is all I wanted to do anyway.
  • //var found = GetDeByName(DEctx, name);
    // if(!found.Any())
    if (!GetDeByName(DEctx, name).Any())
    {
    
    }
  • The “tricky” bits:
    • The Sequence you get at runtime is a “iterate once only” sequence. So, the following code will throw an “System.InvalidOperationException” in System.Data.Linq.dll, with the message “<Message>The query results cannot be enumerated more than once.</Message>. In my code the situations where I’d do things which caused this type of exception was like following:
    • if (!found.Any())
      {
      }
      else
      {
          Debug.WriteLine("Rejected update already existed");
          // need to add this when I added changed to a compiled qu
          //found = GetDEByName(DECtx, DE);
          found.ToList().ForEach(A => Debug.WriteLine(String.Format("{0} {1}", A.DE_Name, A.DE_Classification)));
      }
    • This would break with the exception. The “remedy” I’ve applied has been to use a “ToList()“ to pull the results into a collection in one iteration, and then have the flexibility to iterate over the collection as much as I like. This approach cost a bit of memory, but these have been single row result sets, so there is not too much of a penalty.
      • There are probably other, more elegant, ways to “solve” this problem. But, as part of a quick upgrade to get a BIG performance improvement, they would have been more expensive.
    • // Need to pull the result set into a local list
      var found = GetDEByName(DECtx, DE).ToList();
      if (!found.Any())
      {
      }

The Results

On the program which I was working, the following table is the before, and after results.

Run 1 Run 2 Run 3 Run 4
Base Line No Compiles 114451 115143
All Queries Compile and Static Data Contexts 56906 58120 57177 56515
Percentage Improvement (start – end)/start 50.19077

This is a significant improvement in performance for what amounted to “not too much work”. The hard part was getting “my head around” the Func<> template type declaration. The numbers are from the Stopwatch Class and are ElapsedMilliseconds. The stopwatch object is the only way (superior to all other methods) to get elapsed times for executing objects in .Net.

Conclusion:

  • A vote of thanks to author of “Er. alokpandey’s Blog” for writing the article: “LINQ To SQL Very Slow Performance Without Compile (CompileQuery)”, without which I would not been prompted to implement this way of improving the performance of the program (an many more in the future I expect).
  • The conversion is not painless, if you do not understand what the Func<> is doing. Once you see what that piece of syntax is doing, then things start to fall into place.
  • I did not expect such a big gain in performance from the implementation of this change and I achieved. But, I’d not probed the code to discover where the time was being expended.
  • Apologies if you do not like the code snippet approach this post has taken. Most of the time I try and post code segments which can be used without too much modification. This time the topic is really about statements (declarations of things in the code and the syntax), rather being about “a way of doing something” (which a method can illustrate). So, I’ve used snippets of code to demonstrate the syntax used.
  • Again I yearn for the flexibility which Windows Live Spaces gave, where I could put file of code up as well as a blog post. Something which WordPress does not offer, so code snippets ibid is the solution which I’ve adopted.

, , , , , , , , , , , ,

1 Comment

%d bloggers like this: