Unicode support?

Announcements, questions
un_pogaz
Posts: 9
Joined: 05 Feb 2019, 09:40

Unicode support?

Post by un_pogaz » 05 Feb 2019, 11:06

LuaMacro is powerful, and I'm sure I could do a lot with it, but I have to give him one big blame:
It does not support Unicode.

Unicode and especially UTF-8 are now so common that there is no excuse for not supporting it (see this article from 2003, the idea and the problem can be applied everywhere).
Well, I want conceded that modifying "lmc_send_keys()" will be too complicated and impossible with all the systems of Keystroke sequence, Key Names, Special characters (which don't have an escape, so if you want to write'(', that's no).
So, it would be best to create a new function "lmc_write_text()". In addition to being Unicode compatible, this function would only be used to write/paste a plain text, so no special characters (other the classics '\n', '\r', '\t'...).
Thus, we will be able to separate the execution of the keys "lmc_send_keys()" from the writing of a text "lmc_write_text()" (One is a verbose and has its own vocabulary, the other is readable, simple and that would solve any accentuated character problem).

Thank for reading and this software.

admin
Site Admin
Posts: 735
Joined: 01 Nov 2010, 13:00
Location: Prague, Czech republic
Contact:

Re: Unicode support?

Post by admin » 05 Feb 2019, 21:03

Yes, lmc_send_keys implementation is quite old and has bugs.
If you need more control use lmc_send_input (viewtopic.php?f=12&t=475) which is more complicated to use (more code is required) but has more features and should support unicode.
I don't plan to extend lmc_send_keys currently
Petr Medek
LUAmacros author

un_pogaz
Posts: 9
Joined: 05 Feb 2019, 09:40

Re: Unicode support?

Post by un_pogaz » 06 Feb 2019, 10:38

Unfortunately, this does not work.

The problem is that you want to simulate keyboard writing, and I think that's not the right solution.
What happens if we remapped our keyboard (not the default mapping)? or if we want to write characters that are not present on the keyboard?
Many variables and unknowns that give as an answer: nothing and anything, and a lots of bugs.

Really, I think that creating a new function dedicated to text writing would be "simpler" and would solve all its problems more easily.
lmc_write_text() would not be an extension of lmc_send_keys(), who pastes the assigned text directly (like Ctrl+V). No keyboard writing simulation.
(you could probably try using the clipboard)

admin
Site Admin
Posts: 735
Joined: 01 Nov 2010, 13:00
Location: Prague, Czech republic
Contact:

Re: Unicode support?

Post by admin » 06 Feb 2019, 11:00

I don't see (technically) solution how to "paste" or "inject" text to some application.

And no, I don't plan to investigate possibilities and extend luamacros with this functionality.
Petr Medek
LUAmacros author

un_pogaz
Posts: 9
Joined: 05 Feb 2019, 09:40

Re: Unicode support?

Post by un_pogaz » 11 Feb 2019, 16:39

Okay, I found a parade.
A large amount of character can be written using the "Alt Code".
Not all Unicode, but enough to stop being a big problem.

Code: Select all

function write_altcode(altcode)
	lmc_send_input(18, 0, 0);            -- press ALT
	lmc_send_keys(altcode, 10);          -- typing AltCode 
	lmc_sleep(string.len(altcode) * 10); -- wait until all caracters the AltCode has been send/typing
	lmc_send_input(18, 0, 2);            -- release ALT
end;

izosimovmp
Posts: 2
Joined: 28 Feb 2019, 06:01

Re: Unicode support?

Post by izosimovmp » 28 Feb 2019, 08:24

Please write a complete example code with the output of any Unicode.

admin
Site Admin
Posts: 735
Joined: 01 Nov 2010, 13:00
Location: Prague, Czech republic
Contact:

Re: Unicode support?

Post by admin » 01 Mar 2019, 17:17

I put the reply in another thread you have created: http://hidmacros.eu/forum/viewtopic.php ... 5055#p5055
Petr Medek
LUAmacros author

un_pogaz
Posts: 9
Joined: 05 Feb 2019, 09:40

Re: Unicode support?

Post by un_pogaz » 12 Mar 2019, 11:46

I found/do better than with the Alt Code !

Code: Select all

function write_text(text)
  if (text == nil) then text = "" end;
  
  local tbl = utf8_explode(tostring(text));
  if (tbl.len > 0) then
    for i, c in pairs(tbl.codepoints) do
      lmc_send_input(0, c, 4) -- press
      lmc_send_input(0, c, 6) -- release
    end;
  end;
end;

--[[ utf8_explode / unicode compatibility
 extract from ustring.lua
 https://github.com/wikimedia/mediawiki-extensions-Scribunto/blob/master/includes/engines/LuaCommon/lualib/ustring/ustring.lua

 A private helper that splits a string into codepoints, and also collects the
 starting position of each character and the total length in codepoints.

 @param s string  utf8-encoded string to decode
 @return table { .len, .codepoints, .bytepos}
]]

function utf8_explode( s )
  local rslt = {
    len = 0,
    codepoints = {},
    bytepos = {},
  }

  local i = 1
  local l = string.len( s )
  local cp, b, b2, trail
  local min
  while i <= l do
    b = string.byte( s, i )
    if b < 0x80 then
      -- 1-byte code point, 00-7F
      cp = b
      trail = 0
      min = 0
    elseif b < 0xc2 then
      -- Either a non-initial code point (invalid here) or
      -- an overlong encoding for a 1-byte code point
      return nil
    elseif b < 0xe0 then
      -- 2-byte code point, C2-DF
      trail = 1
      cp = b - 0xc0
      min = 0x80
    elseif b < 0xf0 then
      -- 3-byte code point, E0-EF
      trail = 2
      cp = b - 0xe0
      min = 0x800
    elseif b < 0xf4 then
      -- 4-byte code point, F0-F3
      trail = 3
      cp = b - 0xf0
      min = 0x10000
    elseif b == 0xf4 then
      -- 4-byte code point, F4
      -- Make sure it doesn't decode to over U+10FFFF
      if string.byte( s, i + 1 ) > 0x8f then
        return nil
      end
      trail = 3
      cp = 4
      min = 0x100000
    else
      -- Code point over U+10FFFF, or invalid byte
      return nil
    end

    -- Check subsequent bytes for multibyte code points
    for j = i + 1, i + trail do
      b = string.byte( s, j )
      if not b or b < 0x80 or b > 0xbf then
        return nil
      end
      cp = cp * 0x40 + b - 0x80
    end
    if cp < min then
      -- Overlong encoding
      return nil
    end

    rslt.codepoints[#rslt.codepoints + 1] = cp
    rslt.bytepos[#rslt.bytepos + 1] = i
    rslt.len = rslt.len + 1
    i = i + 1 + trail
  end

  -- Two past the end (for sub with empty string)
  rslt.bytepos[#rslt.bytepos + 1] = l + 1
  rslt.bytepos[#rslt.bytepos + 1] = l + 1

  return rslt;
end;
The write_text() function allows to write any Unicode character string.
write_text() will write each character from its Point Code obtained in a table, thanks to a utf8_explode().
I found the function utf8_explode() here (ustring.lua). There are many functions that are useless to me, so I only extract the one that interests me.

Ok, there is a bug : we can only write the characters of the BMP (Basic multilingage Plan)... which includes the 65535 most "common" characters! (so no emoji, sorry :cry: )
Apparently, this comes from lmc_send_input() which does not accept values greater than 65535, and returns to 0 (integer overflow)

PS: Don't forget to save your Lua script in "UTF-8 (no BOM)", I advise you Notepad++ to do this easily.

admin
Site Admin
Posts: 735
Joined: 01 Nov 2010, 13:00
Location: Prague, Czech republic
Contact:

Re: Unicode support?

Post by admin » 13 Mar 2019, 09:53

un_pogaz wrote:
12 Mar 2019, 11:46
Ok, there is a bug : we can only write the characters of the BMP (Basic multilingage Plan)... which includes the 65535 most "common" characters! (so no emoji, sorry :cry: )
Apparently, this comes from lmc_send_input() which does not accept values greater than 65535, and returns to 0 (integer overflow)
The 65535 limit comes from the Windows API function which accepts only DWORD parameter (see MSDN page).
After bit of googling I found this answer which says such characters should be sent using 2 subsequent sendInput calls using surrogate pair. As lmc_send_input is just wrapper around sendInput, you may try with subsequent calls of lmc_send_input.
Petr Medek
LUAmacros author

un_pogaz
Posts: 9
Joined: 05 Feb 2019, 09:40

Re: Unicode support?

Post by un_pogaz » 13 Mar 2019, 10:31

In ohter words:
lmc_send_input() send/write a UTF-16 "character"...
God damit, it's going to be long and complicated before we get to a function write_text() full Unicode compatible.
But I am beginning to see the end of it and it clearly doesn't seem impossible.

Thank you for your answers :D

EDIT: Ugh, it doesn't seem to work :|
I found on this one page how to "create" the Surrogates pairs, but during execution, the function writes 2 characters. The Surrogates value as good so I'm probably missing other thing (execution order or dwFlags value).

Code: Select all

function mp_write_text(text)
	if (text == nil) then text = "" end;
	local tbl = utf8_explode(tostring(text));
	if (tbl.len > 0) then
		for i, c in pairs(tbl.codepoints) do
			mp_unicode_write(c);
		end;
	end;
end;

function mp_unicode_write(codepoint)
	if (codepoint == nil) then codepoint = "" end;
	codepoint = tonumber(codepoint);
	if (codepoint == nil or codepoint < 0 or codepoint >= 0xd800 and codepoint <= 0xdfff or codepoint >= 0x10ffff) then return end;
	
	if (codepoint < 0x10000) then
		lmc_send_input(0, codepoint, 4); -- press
		lmc_send_input(0, codepoint, 6); -- release
	else
		local utf32 = toBits(codepoint, 32)
		print(utf32)
		print("")
		local w = toBits(tonumber(string.sub(utf32, 1, 16), 2) - 1, 4);
		local x = string.sub(utf32, 17, 22);
		local y = string.sub(utf32, 23, 32);
		
		print("110110" .. w .. x)
		print("110111" .. y)
		
		lmc_send_input(0, tonumber("110110" .. w .. x, 2), 4)
		lmc_send_input(0, tonumber("110111" .. y, 2), 4)
		lmc_send_input(0, tonumber("110110" .. w .. x, 2), 6)
		lmc_send_input(0, tonumber("110111" .. y, 2), 6)
	end;
end;

function toBits(num, bits)
	-- returns a table of bits, most significant first.
	bits = bits or math.max(1, select(2, math.frexp(num)))
	local t = {} -- will contain the bits        
	for b = bits, 1, -1 do
		t[b] = math.fmod(num, 2)
		num = math.floor((num - t[b]) / 2)
	end
	return table.concat(t)
end

--[[ utf8_explode / extract from ustring.lua
 https://github.com/wikimedia/mediawiki-extensions-Scribunto/blob/master/includes/engines/LuaCommon/lualib/ustring/ustring.lua

 A private helper that splits a string into codepoints, and also collects the
 starting position of each character and the total length in codepoints.

 @param s string utf8-encoded to decode
 @return table { .len, .codepoints, .bytepos}
]]

function utf8_explode( s )
	local rslt = {
		len = 0,
		codepoints = {},
		bytepos = {},
	};

	local i = 1;
	local l = string.len( s );
	local cp, b, b2, trail;
	local min;
	while i <= l do
	b = string.byte( s, i );
	if b < 0x80 then
		-- 1-byte code point, 00-7F
		cp = b;
		trail = 0;
		min = 0;
	elseif b < 0xc2 then
		-- Either a non-initial code point (invalid here) or
		-- an overlong encoding for a 1-byte code point
		return nil;
	elseif b < 0xe0 then
		-- 2-byte code point, C2-DF
		trail = 1;
		cp = b - 0xc0;
		min = 0x80;
	elseif b < 0xf0 then
		-- 3-byte code point, E0-EF
		trail = 2;
		cp = b - 0xe0;
		min = 0x800;
	elseif b < 0xf4 then
		-- 4-byte code point, F0-F3
		trail = 3;
		cp = b - 0xf0;
		min = 0x10000;
	elseif b == 0xf4 then
		-- 4-byte code point, F4
		-- Make sure it doesn't decode to over U+10FFFF
		if string.byte( s, i + 1 ) > 0x8f then
		return nil;
		end
		trail = 3;
		cp = 4;
		min = 0x100000;
	else
		-- Code point over U+10FFFF, or invalid byte
		return nil;
	end
	
	-- Check subsequent bytes for multibyte code points
	for j = i + 1, i + trail do
		b = string.byte( s, j );
		if not b or b < 0x80 or b > 0xbf then
		return nil;
		end;
		cp = cp * 0x40 + b - 0x80;
	end;
	if cp < min then
		-- Overlong encoding
		return nil;
	end;
	
	rslt.codepoints[#rslt.codepoints + 1] = cp;
	rslt.bytepos[#rslt.bytepos + 1] = i;
	rslt.len = rslt.len + 1;
	i = i + 1 + trail;
	end;
	
	-- Two past the end (for sub with empty string)
	rslt.bytepos[#rslt.bytepos + 1] = l + 1;
	rslt.bytepos[#rslt.bytepos + 1] = l + 1;
	
	return rslt;
end;

Post Reply