Posts Tagged Tools

Random CSV Test Data Generator


Introduction

I have been building a piece of software which parallel loads CSV data, using the SqlBulkCopy class, into SQL Server tables. If you have ever loaded data into SQL Server, then you would know that SqlBulkCopy is the fastest way to load data. This little project will be the subject of some subsequent blog posts.

To test the parallel loading software I needed some “junk”, or random, data to load into the tables. I just needs a “bunch” of rows to load from a file, which would go into the table “clean”. I did not want to have data errors, at least in the initial testing of the program. I also wanted enough rows so that I can see that the parallel loading is really happening.

After a quick browse on the internet, nothing jumped up as being a quick way to generate data. So, I wrote this piece of software which generates some dummy data form me. I now present that software here for anyone to take and use (modify to your hearts content).

Why load in Parallel?

If you have ever loaded bulk data into any relational database, you will know that it is a slow process. This slowness comes from a number of causes, but one way to combat the slowness is to load multiple tables at the same time. Also, data files often come in different sizes, resulting different tables taking different amounts of time to load, so having something which will load multiple tables at once is very handy. Having a routine which will load numerous tables, on different threads, at the same time means that you can significantly cut the elapse time for the data loads, when compared to a serial process (believe me it does, I’ve done it before today).

Junk or Random Data

The “junk” data has some very special properties, which include:

  • Only numbers (integers), or strings. I did not want the problems of loading other data types (just yet).
  • The strings should be within the maximum length of the database field. Again, I don’t want data errors (just yet).
  • The strings will be simple. There will be no enclosing ‘”’ characters, and the imbedded ‘,’ is out of scope as well. Again, I just wanted the simplest simple string parsing as possible.
  • I did not care if the data made sense. This program makes really random junk text and numbers.
  • There is no referential integrity between the tables being loaded. That means a key in one needed to go into another table. Again, this removes the possibility of loads failing because of violated integrity constraints,
  • There are minimal integrity constraints on the table. This is a two fold constraint:
    1. There are no keys or indexes on the tables.
    2. I just wanted data which would test the load process, not be slowed down by the database doing housekeeping on the way through.

The Main Program

This is a pretty simple main. There are a couple of points to note:

  • The creating of the SqlConnection is a bit a “hangover” from years of writing code which used Oracle. In the Oracle environment connections cost a lot (a relative comparison to SQL Server) to create, and are more flexible (you can have multiple fetches happening on the one connection – the SqlCommand is better isolated).
  • It just passes off to the Generator to do all of the heavy lifting.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.SqlClient;
using System.Diagnostics;

namespace CSVDataGenerator
{
    /// <summary>
    /// Program class, which does some of the work
    /// </summary>
    class Program
    {
        /// <summary>
        /// Where the output files are written
        /// </summary>
        private const string outputDir =
@"C:\Users\Craig\Documents\Visual Studio 10\Projects\ParallelDataLoad\LoadData\Small";
        /// <summary>
        /// XML Control file which currently sets the number of output rows in the
        /// generated output files
        /// </summary>
        private const string controlFile =
@"C:\Users\Craig\Documents\Visual Studio 10\Projects\ParallelDataLoad\LoadData\Small\ControlFile.xml";
        /// <summary>
        /// Connect string to the db which contains the tables.
        /// </summary>
        private const string connectString =
"Data Source=CRAIG-PC-1;Initial Catalog=LoadContainer;Integrated Security=True;Pooling=False";
        /// <summary>
        /// Main method.
        /// 
        /// args">Unused, at present</param>
        static void Main(string[] args)
        {
            var dataGenerator = new DataGenerator(outputDir, controlFile);
            using (SqlConnection con = new SqlConnection(connectString))
            {
                try
                {
                    con.Open();
                    dataGenerator.Generate(con);
                }
                // Naughty, but it is a programmer tool.
                // I just want to know if any exceptions are generated
                catch (Exception ex)
                {
                    Debug.WriteLine(ex);
                }
                finally
                {
                    con.Close();
                }
            }
        }
    }
}

Column Type Enumeration

This is very simple, just number and string types. I use this, in part, because of the code completion snippet for the switch statement, which puts in the case labels from the enumeration.

/// <summary>
/// Simple enum to keep the types which this routine will create
/// </summary>
internal enum ColumnType
{
    String, Number
}

Column Information Class

This is just a simple container class for the column details which are read from the database. This could be a struct , but I have arguments when at times. I should spend some time getting to know how to use the struct type more effectively, but that can wait.

/// <summary>
/// Simple data class (could be a struct but I have arguments with them at times)
/// Just holds the key values for a columns being generated.
/// </summary>
internal class ColumnInfo
{
    private string _coumnName;
    private string _columnType;
    private ColumnType _colTypeEnum;
    private int _columnLength;

    public int ColumnLength
    {
        get { return _columnLength; }
        set { _columnLength = value; }
    }

    internal ColumnType ColTypeEnum
    {
        get { return _colTypeEnum; }
        set { _colTypeEnum = value; }
    }

    public string ColumnType
    {
        get { return _columnType; }
        set { _columnType = value; }
    }

    public string CoumnName
    {
        get { return _coumnName; }
        set { _coumnName = value; }
    }
    public ColumnInfo(string colName, string colType, int colLength)
    {
        this._coumnName = colName;
        this._columnType = colType;
        this._columnLength = colLength;
        if (_columnType == "numeric")
            _colTypeEnum = CSVDataGenerator.ColumnType.Number;
        else
            _colTypeEnum = CSVDataGenerator.ColumnType.String;
    }
}

Data Generator Class

This is the class where all of the “heavy lifting” is done. A couple of key point to note:

  • The XML Control File (see sample below) is set up to define the one size for all output files. Extending this xml file to define the number of rows generated by the table is entirely possible. This feature was not something which I need at this point.
  • You do need the tables created in the database prior to using this. Not a big issue in my mind. But, does slow some uses down.
  • Extending the data types supported by this process, would not be a big issue.
    internal class DataGenerator
    {
        /// <summary>
        /// Private string for where to write the output files
        /// </summary>
        private string outputDir;
        /// <summary>
        /// The path and filename form the control file which will be used
        /// </summary>
        private string controlFile;
        /// <summary>
        /// Number of row to be created, read from the control file
        /// </summary>
        private int RowsRequired;

        /// <summary>
        /// Main constructor for the class.
        /// Just stores the two pieces of file information
        /// 
        /// outputDir">The director which the output will be written to
        /// controlFile">The location and name of the control file which will be used</param>
        public DataGenerator(string outputDir, string controlFile)
        {
            // TODO: Complete member initialization
            this.outputDir = outputDir;
            this.controlFile = controlFile;
        }
        /// <summary>
        /// Default constructor, not call. Probably should throw an exception.
        /// </summary>
        internal void Generate()
        {
            throw new NotImplementedException();
        }

        /// <summary>
        /// Main loop
        /// Reads the control file
        /// Finds all tables
        /// Finds all columns for each table
        /// Generates the random data
        /// </summary>
        /// <param name="con">SQLConnect used to find all the tables</param>
        internal void Generate(SqlConnection con)
        {
            ReadControlFile();
            using (SqlCommand cmd = new SqlCommand(
                "select table_name from INFORMATION_SCHEMA.TABLES", con))
            {
                using (SqlDataReader rdr = cmd.ExecuteReader())
                {
                    while (rdr.Read())
                    {
                        string tableName = rdr.GetString(0);
                        var ColumnMap = ReadColumns(con, tableName);
                        BuildTestData(ColumnMap, RowsRequired, tableName, outputDir);
                    }
                }
            }
        }
        /// <summary>
        /// Writes the output file, with the number of rows required
        /// 
        /// ColumnMap">The columns, in order
        /// RowsRequired">The number of rows to write. An argument so that different length outputs could be produced
        /// tableName">Table Name from the Database, used to create the output file name</param>
        /// <param name="outputDir">The directory that the files are written to.</param>
        private void BuildTestData(List<ColumnInfo> ColumnMap, int RowsRequired, string tableName, string outputDir)
        {
            string writeFile = outputDir + "\\" + tableName + ".csv";
            // Skip if already exists. Handy if you don't want to make all of the files.
            if (File.Exists(writeFile))
                return;
            // Random number generator. Initialised with the most random thing I can easily find.
            Random random = new Random((int)DateTime.Now.Ticks);
            // Lots of appending, use a StringBuilder for better performance.
            StringBuilder line = new StringBuilder();
            // Using the write, does the clean up for me.
            using (StreamWriter writer = new StreamWriter(writeFile))
            {
                for (int i = 0; i < RowsRequired; i++)
                {
                    // First column written for the row does not need the ',' first
                    bool bFirst = true;
                    foreach (var col in ColumnMap)
                    {
                        if (bFirst)
                            bFirst = false;
                        else
                            line.Append(",");
                        // Make some data type sensitive random data
                        switch (col.ColTypeEnum)
                        {
                            case ColumnType.String:
                                line.Append(BuildRandomString(col.ColumnLength, random));
                                break;
                            case ColumnType.Number:
                                line.Append(random.Next());
                                break;
                            default:
                                break;
                        }
                    }
                    // write the line to the file
                    writer.WriteLine(line.ToString());
                    // Empty the StringBuilder, saves making a new one each line.
                    line.Clear();
                }
            }
        }
        /// <summary>
        /// Builds a string of "normal" ASCII characters, minus '"' and ','
        /// 
        /// maxLength">The maximum length of the string column
        /// rnd">The Random object to use</param>
        /// <returns></returns>
        private string BuildRandomString(int maxLength, Random rnd)
        {
            // Random length string up to max lenght
            int length = rnd.Next(0, maxLength);
            // StringBuilder, because lots of appends here.
            StringBuilder rndTxt = new StringBuilder();
            for (int i = 0; i < length; i++)
            {
                char nextChar = Convert.ToChar(rnd.Next(32, 126));
                while(nextChar == ',' || nextChar == '"')
                    nextChar = Convert.ToChar(rnd.Next(32, 126));
                rndTxt.Append(nextChar);
            }
            return rndTxt.ToString();
        }
        /// <summary>
        /// Reads the columns from the table, and builds the internal column list.
        /// </summary>
        /// <param name="con">SqlConnection which is used to provide the connection string to be used
        /// tableName">The table we want the columns from</param>
        /// <returns>List of column information objects for the table</returns>
        private List<ColumnInfo> ReadColumns(SqlConnection con, string tableName)
        {
            List<ColumnInfo> results = null;
            // using SqlConnection, lets the framework do the clean up
            using (SqlConnection con1 = new SqlConnection(con.ConnectionString))
            {
                con1.Open();
                // SQL Statement, order by column ordinal, keeps the columns in order
                using (SqlCommand cmd = new SqlCommand(
@"select column_name, data_type, character_maximum_Length, ORDINAL_POSITION " +
"from INFORMATION_SCHEMA.COLUMNS where table_name = @1 ORDER BY ORDINAL_POSITION", con1))
                {
                    // Bind the table name into the statement
                    SqlParameter param1 = new SqlParameter("1", tableName);
                    cmd.Parameters.Add(param1);
                    using (SqlDataReader rdr = cmd.ExecuteReader())
                    {
                        while (rdr.Read())
                        {
                            if (results == null)
                                results = new List<ColumnInfo>();
                            string columnName = rdr.GetString(0);
                            string columnType = rdr.GetString(1);
                            switch (columnType)
                            {
                                case "numeric":
                                    results.Add(new ColumnInfo(columnName, columnType, 0));
                                    break;
                                case "nvarchar":
                                    int columnLength = rdr.GetInt32(2);
                                    results.Add(new ColumnInfo(columnName, columnType, columnLength));
                                    break;
                                default:
                                    break;
                            }
                        }
                    }
                    return results;
                }
            }
        }
        /// <summary>
        /// Reads the control file and pulls out the number of rows to write
        /// </summary>
        private void ReadControlFile()
        {
            XDocument xDoc = XDocument.Load(controlFile);
            string rows = (from a in xDoc.Elements("Control").Elements("FileSize").Attributes("Rows")
                            select a.Value).First();
            RowsRequired = Convert.ToInt32(rows);
        }
    }

Conclusions

This routine is “dumb as a box of hammers”, and only designed to generate random data to be loaded to Sql Server. But, for a quick set of data to test with it does the job.

I hope you find this useful. Like much of what I post here, some assembly is required. But copy and paste coding, is far simpler (and quicker) than writing something from scratch. Please feel free to extend this shell in any way you like.

, , , , , , ,

1 Comment

2010 in review


The stats helper monkeys at WordPress.com mulled over how this blog did in 2010, and here’s a high level summary of its overall blog health:

Healthy blog!

The Blog-Health-o-Meter™ reads Fresher than ever.

Crunchy numbers

Featured image

A Boeing 747-400 passenger jet can hold 416 passengers. This blog was viewed about 2,400 times in 2010. That’s about 6 full 747s.

 

In 2010, there were 51 new posts, growing the total archive of this blog to 101 posts. There were 88 pictures uploaded, taking up a total of 6mb. That’s about 2 pictures per week.

The busiest day of the year was December 14th with 78 views. The most popular post that day was LINQ Performance Tuning: Using the LookUp Class.

Where did they come from?

The top referring sites in 2010 were ifreestores.com, social.msdn.microsoft.com, en.wordpress.com, facebook.com, and icreditcard.biz.

Some visitors came searching, mostly for microstoven, eband patches, igrouping example, expandoobject linq, and microstoven recipes.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

1

LINQ Performance Tuning: Using the LookUp Class November 2010

2

Convert an XPS to JPEG, PNG, TIFF or BMP in A4 Pages November 2010
3 comments

3

ATI Mobility Driver Upgrade – Installer Blue Screens October 2010

4

Build Pivot Tables from Analysis Services Cube in C# June 2010

5

LINQ SelectMany and IGrouping October 2010
1 comment

, , , , , , ,

Leave a comment

Debug.WriteLine in C# 4.0 and .Net Framework 4.0


Introduction

This is one is a very short blog post. There is a one improvement which has made its way into the C# 4.0 and .Net Framework, which I wish to share.

This improvement is a very simple one. If I had a paranoid streak, I would say Microsoft must have been watching the code I’ve been writing. Why? Because the code I write is littered with the following:

int i = 0;
// The old way
Debug.WriteLine(String.Format("What is i {0} - old way", i));

I use the debug output window in Visual Studio heavily. I find shoving formatted strings into the debug output window is one my standard ways of producing debug diagnostics from programs in development.

The Enhancement

[ConditionalAttribute("DEBUG")]
public static void WriteLine(
    string format,
    params Object[] args
);

The above is the prototype of the addition to the Debug class, another overload of the WriteLine method. This WriteLine method is document here : Debug.WriteLine.

Simply put this is a WriteLine method which encapsulates the String.Format call. Which results in code like:

int i = 0;
// The old way
Debug.WriteLine(String.Format("What is i {0} - old way", i));
// New in Version 4 of C# and .Net Framework
Debug.WriteLine("What is i {0} - New C# 4.0 way", i);

Conclusion

It’s a great improvement. It will save me a lot of keystrokes when I’m developing. But this enhancement , has a downside, now I need to remember to use it!

, , , , , , , , , , ,

Leave a comment

%d bloggers like this: