Random CSV Test Data Generator


Introduction

I have been building a piece of software which parallel loads CSV data, using the SqlBulkCopy class, into SQL Server tables. If you have ever loaded data into SQL Server, then you would know that SqlBulkCopy is the fastest way to load data. This little project will be the subject of some subsequent blog posts.

To test the parallel loading software I needed some “junk”, or random, data to load into the tables. I just needs a “bunch” of rows to load from a file, which would go into the table “clean”. I did not want to have data errors, at least in the initial testing of the program. I also wanted enough rows so that I can see that the parallel loading is really happening.

After a quick browse on the internet, nothing jumped up as being a quick way to generate data. So, I wrote this piece of software which generates some dummy data form me. I now present that software here for anyone to take and use (modify to your hearts content).

Why load in Parallel?

If you have ever loaded bulk data into any relational database, you will know that it is a slow process. This slowness comes from a number of causes, but one way to combat the slowness is to load multiple tables at the same time. Also, data files often come in different sizes, resulting different tables taking different amounts of time to load, so having something which will load multiple tables at once is very handy. Having a routine which will load numerous tables, on different threads, at the same time means that you can significantly cut the elapse time for the data loads, when compared to a serial process (believe me it does, I’ve done it before today).

Junk or Random Data

The “junk” data has some very special properties, which include:

  • Only numbers (integers), or strings. I did not want the problems of loading other data types (just yet).
  • The strings should be within the maximum length of the database field. Again, I don’t want data errors (just yet).
  • The strings will be simple. There will be no enclosing ‘”’ characters, and the imbedded ‘,’ is out of scope as well. Again, I just wanted the simplest simple string parsing as possible.
  • I did not care if the data made sense. This program makes really random junk text and numbers.
  • There is no referential integrity between the tables being loaded. That means a key in one needed to go into another table. Again, this removes the possibility of loads failing because of violated integrity constraints,
  • There are minimal integrity constraints on the table. This is a two fold constraint:
    1. There are no keys or indexes on the tables.
    2. I just wanted data which would test the load process, not be slowed down by the database doing housekeeping on the way through.

The Main Program

This is a pretty simple main. There are a couple of points to note:

  • The creating of the SqlConnection is a bit a “hangover” from years of writing code which used Oracle. In the Oracle environment connections cost a lot (a relative comparison to SQL Server) to create, and are more flexible (you can have multiple fetches happening on the one connection – the SqlCommand is better isolated).
  • It just passes off to the Generator to do all of the heavy lifting.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.SqlClient;
using System.Diagnostics;

namespace CSVDataGenerator
{
    /// <summary>
    /// Program class, which does some of the work
    /// </summary>
    class Program
    {
        /// <summary>
        /// Where the output files are written
        /// </summary>
        private const string outputDir =
@"C:\Users\Craig\Documents\Visual Studio 10\Projects\ParallelDataLoad\LoadData\Small";
        /// <summary>
        /// XML Control file which currently sets the number of output rows in the
        /// generated output files
        /// </summary>
        private const string controlFile =
@"C:\Users\Craig\Documents\Visual Studio 10\Projects\ParallelDataLoad\LoadData\Small\ControlFile.xml";
        /// <summary>
        /// Connect string to the db which contains the tables.
        /// </summary>
        private const string connectString =
"Data Source=CRAIG-PC-1;Initial Catalog=LoadContainer;Integrated Security=True;Pooling=False";
        /// <summary>
        /// Main method.
        /// 
        /// args">Unused, at present</param>
        static void Main(string[] args)
        {
            var dataGenerator = new DataGenerator(outputDir, controlFile);
            using (SqlConnection con = new SqlConnection(connectString))
            {
                try
                {
                    con.Open();
                    dataGenerator.Generate(con);
                }
                // Naughty, but it is a programmer tool.
                // I just want to know if any exceptions are generated
                catch (Exception ex)
                {
                    Debug.WriteLine(ex);
                }
                finally
                {
                    con.Close();
                }
            }
        }
    }
}

Column Type Enumeration

This is very simple, just number and string types. I use this, in part, because of the code completion snippet for the switch statement, which puts in the case labels from the enumeration.

/// <summary>
/// Simple enum to keep the types which this routine will create
/// </summary>
internal enum ColumnType
{
    String, Number
}

Column Information Class

This is just a simple container class for the column details which are read from the database. This could be a struct , but I have arguments when at times. I should spend some time getting to know how to use the struct type more effectively, but that can wait.

/// <summary>
/// Simple data class (could be a struct but I have arguments with them at times)
/// Just holds the key values for a columns being generated.
/// </summary>
internal class ColumnInfo
{
    private string _coumnName;
    private string _columnType;
    private ColumnType _colTypeEnum;
    private int _columnLength;

    public int ColumnLength
    {
        get { return _columnLength; }
        set { _columnLength = value; }
    }

    internal ColumnType ColTypeEnum
    {
        get { return _colTypeEnum; }
        set { _colTypeEnum = value; }
    }

    public string ColumnType
    {
        get { return _columnType; }
        set { _columnType = value; }
    }

    public string CoumnName
    {
        get { return _coumnName; }
        set { _coumnName = value; }
    }
    public ColumnInfo(string colName, string colType, int colLength)
    {
        this._coumnName = colName;
        this._columnType = colType;
        this._columnLength = colLength;
        if (_columnType == "numeric")
            _colTypeEnum = CSVDataGenerator.ColumnType.Number;
        else
            _colTypeEnum = CSVDataGenerator.ColumnType.String;
    }
}

Data Generator Class

This is the class where all of the “heavy lifting” is done. A couple of key point to note:

  • The XML Control File (see sample below) is set up to define the one size for all output files. Extending this xml file to define the number of rows generated by the table is entirely possible. This feature was not something which I need at this point.
  • You do need the tables created in the database prior to using this. Not a big issue in my mind. But, does slow some uses down.
  • Extending the data types supported by this process, would not be a big issue.
    internal class DataGenerator
    {
        /// <summary>
        /// Private string for where to write the output files
        /// </summary>
        private string outputDir;
        /// <summary>
        /// The path and filename form the control file which will be used
        /// </summary>
        private string controlFile;
        /// <summary>
        /// Number of row to be created, read from the control file
        /// </summary>
        private int RowsRequired;

        /// <summary>
        /// Main constructor for the class.
        /// Just stores the two pieces of file information
        /// 
        /// outputDir">The director which the output will be written to
        /// controlFile">The location and name of the control file which will be used</param>
        public DataGenerator(string outputDir, string controlFile)
        {
            // TODO: Complete member initialization
            this.outputDir = outputDir;
            this.controlFile = controlFile;
        }
        /// <summary>
        /// Default constructor, not call. Probably should throw an exception.
        /// </summary>
        internal void Generate()
        {
            throw new NotImplementedException();
        }

        /// <summary>
        /// Main loop
        /// Reads the control file
        /// Finds all tables
        /// Finds all columns for each table
        /// Generates the random data
        /// </summary>
        /// <param name="con">SQLConnect used to find all the tables</param>
        internal void Generate(SqlConnection con)
        {
            ReadControlFile();
            using (SqlCommand cmd = new SqlCommand(
                "select table_name from INFORMATION_SCHEMA.TABLES", con))
            {
                using (SqlDataReader rdr = cmd.ExecuteReader())
                {
                    while (rdr.Read())
                    {
                        string tableName = rdr.GetString(0);
                        var ColumnMap = ReadColumns(con, tableName);
                        BuildTestData(ColumnMap, RowsRequired, tableName, outputDir);
                    }
                }
            }
        }
        /// <summary>
        /// Writes the output file, with the number of rows required
        /// 
        /// ColumnMap">The columns, in order
        /// RowsRequired">The number of rows to write. An argument so that different length outputs could be produced
        /// tableName">Table Name from the Database, used to create the output file name</param>
        /// <param name="outputDir">The directory that the files are written to.</param>
        private void BuildTestData(List<ColumnInfo> ColumnMap, int RowsRequired, string tableName, string outputDir)
        {
            string writeFile = outputDir + "\\" + tableName + ".csv";
            // Skip if already exists. Handy if you don't want to make all of the files.
            if (File.Exists(writeFile))
                return;
            // Random number generator. Initialised with the most random thing I can easily find.
            Random random = new Random((int)DateTime.Now.Ticks);
            // Lots of appending, use a StringBuilder for better performance.
            StringBuilder line = new StringBuilder();
            // Using the write, does the clean up for me.
            using (StreamWriter writer = new StreamWriter(writeFile))
            {
                for (int i = 0; i < RowsRequired; i++)
                {
                    // First column written for the row does not need the ',' first
                    bool bFirst = true;
                    foreach (var col in ColumnMap)
                    {
                        if (bFirst)
                            bFirst = false;
                        else
                            line.Append(",");
                        // Make some data type sensitive random data
                        switch (col.ColTypeEnum)
                        {
                            case ColumnType.String:
                                line.Append(BuildRandomString(col.ColumnLength, random));
                                break;
                            case ColumnType.Number:
                                line.Append(random.Next());
                                break;
                            default:
                                break;
                        }
                    }
                    // write the line to the file
                    writer.WriteLine(line.ToString());
                    // Empty the StringBuilder, saves making a new one each line.
                    line.Clear();
                }
            }
        }
        /// <summary>
        /// Builds a string of "normal" ASCII characters, minus '"' and ','
        /// 
        /// maxLength">The maximum length of the string column
        /// rnd">The Random object to use</param>
        /// <returns></returns>
        private string BuildRandomString(int maxLength, Random rnd)
        {
            // Random length string up to max lenght
            int length = rnd.Next(0, maxLength);
            // StringBuilder, because lots of appends here.
            StringBuilder rndTxt = new StringBuilder();
            for (int i = 0; i < length; i++)
            {
                char nextChar = Convert.ToChar(rnd.Next(32, 126));
                while(nextChar == ',' || nextChar == '"')
                    nextChar = Convert.ToChar(rnd.Next(32, 126));
                rndTxt.Append(nextChar);
            }
            return rndTxt.ToString();
        }
        /// <summary>
        /// Reads the columns from the table, and builds the internal column list.
        /// </summary>
        /// <param name="con">SqlConnection which is used to provide the connection string to be used
        /// tableName">The table we want the columns from</param>
        /// <returns>List of column information objects for the table</returns>
        private List<ColumnInfo> ReadColumns(SqlConnection con, string tableName)
        {
            List<ColumnInfo> results = null;
            // using SqlConnection, lets the framework do the clean up
            using (SqlConnection con1 = new SqlConnection(con.ConnectionString))
            {
                con1.Open();
                // SQL Statement, order by column ordinal, keeps the columns in order
                using (SqlCommand cmd = new SqlCommand(
@"select column_name, data_type, character_maximum_Length, ORDINAL_POSITION " +
"from INFORMATION_SCHEMA.COLUMNS where table_name = @1 ORDER BY ORDINAL_POSITION", con1))
                {
                    // Bind the table name into the statement
                    SqlParameter param1 = new SqlParameter("1", tableName);
                    cmd.Parameters.Add(param1);
                    using (SqlDataReader rdr = cmd.ExecuteReader())
                    {
                        while (rdr.Read())
                        {
                            if (results == null)
                                results = new List<ColumnInfo>();
                            string columnName = rdr.GetString(0);
                            string columnType = rdr.GetString(1);
                            switch (columnType)
                            {
                                case "numeric":
                                    results.Add(new ColumnInfo(columnName, columnType, 0));
                                    break;
                                case "nvarchar":
                                    int columnLength = rdr.GetInt32(2);
                                    results.Add(new ColumnInfo(columnName, columnType, columnLength));
                                    break;
                                default:
                                    break;
                            }
                        }
                    }
                    return results;
                }
            }
        }
        /// <summary>
        /// Reads the control file and pulls out the number of rows to write
        /// </summary>
        private void ReadControlFile()
        {
            XDocument xDoc = XDocument.Load(controlFile);
            string rows = (from a in xDoc.Elements("Control").Elements("FileSize").Attributes("Rows")
                            select a.Value).First();
            RowsRequired = Convert.ToInt32(rows);
        }
    }

Conclusions

This routine is “dumb as a box of hammers”, and only designed to generate random data to be loaded to Sql Server. But, for a quick set of data to test with it does the job.

I hope you find this useful. Like much of what I post here, some assembly is required. But copy and paste coding, is far simpler (and quicker) than writing something from scratch. Please feel free to extend this shell in any way you like.

, , , , , , ,

  1. #1 by aussiecraig on February 19, 2011 - 6:14 pm

    For those who have read this far, an apology.
    I forgot to post the XML control file. If you read LINQ to XML you probably can create one.
    For my sin omission, I apologies, again. As an act of contrition, I now post the xml file.

    The XML control file is:


    <?xml version="1.0" encoding="utf-8" ?>
    <Control>
    <FileSize Rows="50"/>
    </Control>

    Craig

Leave a comment