Building
a Random Character Value Generator
Oracle
Database 10g provides a random value generator via
a package named DBMS_RANDOM,
which can be used to generate random values. However, those
values are arbitrary, not representative of the real-life system.
For instance, the following an example of how you can generate
a string 30 characters long:
SQL>
select dbms_random.string('P',30) from dual;
DBMS_RANDOM.STRING('P',30)
-----------------------------------------------------
O=*KXh}#O`D~1|'^VheWr:jZ>!:P}3
The
package DBMS_RANDOM
has several functions and procedures, one of which is STRING(),
which accepts two parameters and returns a character value.
The types of characters generated are determined by the first
parameter — OPTION.
Here are the options and the types of string generated:
| Option |
Type
of String Produced |
| U |
Any
uppercase alphabetic characters |
| L |
Any
lowercase alphabetic characters |
| A |
Any
mixed case alphabetic characters |
| X |
Any
alphanumeric character in upper case |
| P |
Any
printable character |
You
can also use lowercase characters for the options; e.g., x instead
of X. In this example, we used X, which produced a string of
printable characters, not just alphabetic ones. Clearly, this
kind of character string will not represent a customer name.
Using an option such as “A” would generate a more
appropriate string.
Populating
a Table
Let’s
start with a real-life example. In the case of Acme Bank’s
application development, the team is mulling over the proper
indexing, partitioning, physical layout, materialized view planning,
and so on, for the ACCOUNTS table, which holds the customers’
account data. Here is how the table looks :
SQL> desc accounts
Name Null? Type
----------------- -------- ------------
ACC_NO NOT NULL NUMBER
FIRST_NAME NOT NULL VARCHAR2(30)
LAST_NAME NOT NULL VARCHAR2(30)
ACC_TYPE NOT NULL VARCHAR2(1)
FOLIO_ID NUMBER
SUB_ACC_TYPE VARCHAR2(30)
ACC_OPEN_DT NOT NULL DATE
ACC_MOD_DT DATE
ACC_MGR_ID NUMBER
The
columns are fairly self-explanatory. To represent customers
in the real world, the following requirements have been specified
for the columns:
Column
Name |
Purpose |
Data
Pattern |
ACC_NO
|
Account
Number |
Any
number less than 10 digits |
FIRST_NAME
|
The
first name |
-
10%
Alan
-
10%
Barbara
-
5%
Charles
-
5%
David
-
15%
Ellen
-
20%
Frank
-
10%
George
-
5%
Hillary
-
10%
Iris
-
10%
Josh
|
LAST_NAME |
The
last name |
Any
alphabetic character between four and 30 but 25 percent
should be “Smith” |
ACC_TYPE |
The
type of account — Savings, Checking, and so on |
20
percent each of S, C, M, D, and X |
FOLIO_ID |
The
folio ID from the other systems |
Half
NULL and the rest half a number related to the account
number |
SUB_ACC_TYPE |
If
the customer is incorporated, then sub-account types,
if any 75 percent null |
From
the values populated:
|
ACC_OPEN_DT |
Date
account was opened |
A
date between now and 500 days ago |
ACC_MGR_ID |
The
ID of the account manager servicing the account |
There
are five account managers, with account percentages distributed
as follows:
-
1
— 40 percent
-
2
— 10 percent
-
3
— 10 percent
-
4
—10 percent
-
5
— 30 percent
|
As
you can see, some fairly complex requirements were specified,
but for good reasons. These accurately reflect how the data
will be distributed in real life. In real life, there will be
customers with first names like “Josh” and “Ellen,”
not “XepqjEuF”; so, the names must be chosen from
the set of possible names. And, in the U.S., people are called
by a variety of last names. Thus, we want a semi-random distribution
with 25 percent of a very popular last name, “Smith.”
Generating
Random Numbers
Before
we go any further, we have to discuss how to generate random
numbers. The package dbms_random
contains a function to return random numbers. If you want to
return any random number between -2^31 and +2^31, use the function
RANDOM. The following shows how we have generated numbers 10
times:
SQL> begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||dbms_random.random);
4 end loop;
5 end;
6 /
Random Number=-81420432
Random Number=-1024262734
Random Number=-1965250926
Random Number=1439118604
Random Number=675429938
Random Number=-466227661
Random Number=613708106
Random Number=-1154140330
Random Number=-643127572
Random Number=-133140229
This
function is available but deprecated. Oracle recommends using
a new function called VALUE().
SQL> l
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||dbms_random.value);
4 end loop;
5* end;
SQL> /
Random Number=.547109841457281046373693994862305187
Random Number=.00500720861223234783817674992068380455
Random Number=.86974675431616311272549579082479240362
Random Number=.16306119784529083761710557238498944243
Random Number=.39327329046753189206427695323437381763
Random Number=.68307381430584611139249432690613072007
Random Number=.24640768414299435941101562183729221882
Random Number=.50568180275705934132098716939122047439
Random Number=.73345513921455391594299189253661168712
Random Number=.71764049051903979360796313613342367114
Note
the difference; VALUE returns a positive random decimal number
less than one, with 38 digits after the decimal point. In most
cases, this is probably enough, with some modification (if you
are looking for a 10-digit whole number, for instance, simply
multiply it by 10,000,000,000 and use FLOOR() to discard the
decimals).
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||
4 floor (10000000000 * dbms_random.value)
5 );
6 end loop;
7* end;
SQL> /
Random Number=939084911
Random Number=690705371
Random Number=4696700513
Random Number=7978266084
Random Number=5157885833
Random Number=8902042948
Random Number=5839885968
Random Number=6207324613
Random Number=5633096626
Random Number=1891871746
The
function VALUE() is overloaded; another variation of the functions
accepts a set of low and high values and returns a number between
them. To generate a number between 10 and 20, for instance,
you will use
dbms_random.value
(10, 20)
Note:
The numbers generated will be more than or equal to 10, but
less than (never equal to) 20.
Using
it to generate multiple values, we get
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random Number='||
4 dbms_random.value (10,20)
5 );
6 end loop;
7* end;
SQL> /
Random Number=11.0908840421899990054735051046498510644
Random Number=14.5391525077977177619575126573478641174
Random Number=11.1343623028437386191996545034616142284
Random Number=14.8831724931888089801812805693409810454
Random Number=11.7294096895635532492353976105129628219
Random Number=15.0589018451126293465505440533299716691
Random Number=15.6152418028444556585301007183814067306
Random Number=18.2903839241741813303152060555229669855
Random Number=13.0914352695913402037626709927051618367
Random Number=12.6027933168087755621076348461280998683
As
you can see, the numbers are generated with 37 digits after
the decimal point. This is a very valuable function and will
suffice for lmost types of random number generation. If you
want to generate only whole numbers, use the same approach used
earlier, using the FLOOR() function.
floor(dbms_random.value
(10,20))
Generating
Specific Characters and Strings
So,
how can we generate specific strings that follow a predetermined
statistical pattern?
We
can borrow a page from the probability theory text to accomplish
this. The trick is to use a method similar to Monte-Carlo simulation
used by statisticians worldwide. In this approach, we generate
a random number, between one and 100 (both inclusive). Over
a period of time, the probability that a specific number, say
six, will turn up is exactly one time out of 100, or 1 percent
of the time. In fact, all the numbers have 1/100 probability.
Going by the same approach, the probability that either of two
numbers — say, 1 and 2 — will be 2 percent. And,
of course, the probability that any one of numbers between one
and 10 will turn up is 10 percent. We will use this to configure
the probability of the random value.
Take,
for instance, the value of the column ACC_TYPE,
which calls for equal probability of S, C, M, D and X; or 20
percent probability each. If we generate a whole number between
one and five (both inclusive), the probability of each number
will be 20 percent. Then we can use a DECODE()
function to get the ACC_TYPE
value.
SQL> select
2 decode (
3 floor (
4 dbms_random.value (1,6)
5 ),
6 1,'S',
7 2,'C',
8 3,'M',
9 4,'D',
10 'X'
11 )
12 from dual;
First,
we are generating a number between one and five (line 4). Since
the number is generated is less than the highest value passed
as a parameter, we have specified six. And since we want a whole
number, we have used the FLOOR()
function in line 3. It truncates all decimal values from the
generated number. Depending on the number obtained, we used
DECODE() to get
one of the values — S, C, M, D, or X. Since the numbers
1, 2, 3, 4, and 5 will have equal probability of being generated,
so will be the letters — at 20 percent each.
This
technique is very valuable for generating random, but useful
values, as shown previously. The same approach can be used to
generate almost all types of pre-determined random values.
Random
Values with NULLs
Remember,
the requirement for FOLIO_ID
is a little different. It needs only 50 percent of the values
populated; the rest should be NULL.
How can we achieve this?
Quite
simply, we will use the same probability approach with a twist:
we will use a determination of yes or no. Generating a random
number between one and 100 will ensure 1 percent probability
of each number. Hence, a number less than 51 will have exactly
50 percent probability of occurring. We can use this in a CASE
statement to get the value.
SQL> select
2 case
3 when dbms_random.value (1,100) < 51 then null
4 else
5 floor(dbms_random.value(1,100))
6 end
7 from dual;
On
line 3, we should check to see whether the number generated
is less than 51. If so, we return NULL.
Since the probability of a sub-51 number is 50 percent , we
have NULLs occurring 50 percent of the time as well. In the
other 50 percent of the time, we have generated a value to be
used as a FOLIO_ID.
Random
Strings of Random Length
In
dbms_random.string,
a random string is generated, but of fixed length. That is not
representative of real life; in reality, people have last names
of varying lengths. In this example, the requirement is to have
a length between four and 30 characters. To facilitate this,
we can pass the length as a random number as well to the function
dbms_random.string
in line 6 below.
1 begin
2 for i in 1..10 loop
3 dbms_output.put_line('Random String='||
4 dbms_random.string (
5 'A',
6 dbms_random.value(4,30)
7 )
8 );
9 end loop;
10* end;
SQL> /
Random String=RniQZGquFVJYFpGLOvtNd
Random String=GhcphpcsaCXlhigRQY
Random String=JtakoelUf
Random String=BgCOu
Random String=QFBzQxcHqGlHWkZFmnN
Random String=lSxVjqJvpwBB
Random String=jfhNARzALrLOKZRpOwnhrzz
Random String=KuFtdJcqQpjkrFmzFbzcXnYFGjWo
Random String=BhuZ
Random String=GebcqcgvzBfEpTYnJPmYAQdb
Notice
that the strings are of different lengths. Remember, 25 percent
of the last names must be “Smith,” and the rest
must have random lengths. We can accomplish this by combining
the random strings and the Monte-Carlo approach:
decode (
floor(dbms_random.value(1,5)),
1,'Smith',
dbms_random.string ('A',dbms_random.value(4,30))
)
The
previous expression will return “Smith” 25 percent
of the time and a random alphabetic string between four and
30 characters long the rest of the time.
Putting
it All Together
Now
that you understood the building blocks of the randomization
approach, you can put them together to build the account record
generation PL/SQL block as shown below. In the following example,
we are loading 100,000 records into the table ACCOUNTS.
Here is the loading program in full:
begin
for l_acc_no in 1..100000 loop
insert into accounts
values
(
l_acc_no,
-- First Name
decode (
floor(dbms_random.value (1,21)),
1, 'Alan',
2, 'Alan',
3, 'Barbara',
4, 'Barbara',
5, 'Charles',
6, 'David',
7, 'Ellen',
8, 'Ellen',
9, 'Ellen',
10, 'Frank',
11, 'Frank',
12, 'Frank',
13, 'George',
14, 'George',
15, 'George',
16, 'Hillary',
17, 'Iris',
18, 'Iris',
19,'Josh',
20,'Josh',
'XXX'
),
-- Last Name
decode (
floor(dbms_random.value(1,5)),
1,'Smith',
dbms_random.string ('A',dbms_random.value(4,30))
),
-- Account Type
decode (
floor(dbms_random.value (1,5)),
1,'S',2,'C',3,'M',4,'D','X'
),
-- Folio ID
case
when dbms_random.value (1,100) < 51 then null
else
l_acc_no + floor(dbms_random.value(1,100))
end,
-- Sub Acc Type
case
when dbms_random.value (1,100) < 76 then null
else
decode (floor(dbms_random.value (1,6)),
1,'S',2,'C',3,'C',4,'C',5,'C',null)
end,
-- Acc Opening Date
sysdate - dbms_random.value(1,500),
-- Account Manager ID
decode (
floor(dbms_random.value (1,11)),
1,1,2,1,3,1,4,1,5,2,6,3,7,4,8,5,9,5,10,5,0
)
);
end loop;
commit;
end;
Now,
how do we know that all these exercise yielded fruit? After
this table is loaded, let’s see the actual distribution:
SQL>
select first_name, count(*) from accounts group by first_name;
FIRST_NAME COUNT(*)
------------------------------ ----------
Alan 9834
Barbara 10224
Charles 5046
David 4980
Ellen 15094
Frank 14960
George 14890
Hillary 4898
Iris